[profiler] Rewrite the log profiler's handling of sample hits.
The old code had quite a few problems:
1. It would result in stack corruption under heavy load, somehow. This typically
resulted in an assertion when starting the unwinding process because the
unwinding flags would be inconsistent.
2. It sometimes wrote complete garbage into the statistical buffers.
3. As a result of using the helper thread to assemble collected samples into log
buffers, it did not work on platforms where the helper thread is disabled.
4. Performance was poor under heavy load because waking up the helper thread to
assemble events would always trigger a counters sample + dump.
5. The code was very complicated and hard to understand/maintain.
(Points (1) and (2) can probably be attributed to point (5).)
In the new sampling world order, we use a lock-free allocator and a lock-free
reuse queue to acquire memory for sample hit events in the sample hit callback.
We then fill it with the info we collect during the async stack walk. Finally,
we ship it off to a separate dumper thread (via another lock-free queue) where
it gets turned into a log buffer and then shipped off to the writer thread.
This new approach appears very stable; I've yet to make it crash in any of the
stress tests I typically run. In a heavy workload, the helper thread's CPU
usage was previously around 25-30%. It is now down to basically 0% while the
dumper thread is only around 15%. CPU usage by threads that are collecting
samples is a bit higher than before, but only until the reuse queue converges
on a size that is big enough to satisfy all sample event requests. At that
point, no more allocations from the lock-free allocator happen, and performance
improves.