the objects present in the heap at the end of a garbage collection
(this is called heap shot and currently implemented only for the
sgen garbage collector).
+Another available profiler mode is the \f[I]sampling\f[] or
+\f[I]statistical\f[] mode: periodically the program is sampled and
+the information about what the program was busy with is saved.
+This allows to get information about the program behaviour without
+degrading its performance too much (usually less than 10%).
.SS Basic profiler usage
.PP
The simpler way to use the profiler is the following:
.PP
You will still be able to inspect information about the sequence of
calls that lead to each allocation because at each object
-allocation a stack trace is collected as well.
+allocation a stack trace is collected if full enter/leave
+information is not available.
.PP
To periodically collect heap shots (and exclude method and
allocation events) use the following options (making sure you run
with the sgen garbage collector):
.PP
\f[B]mono\ --gc=sgen\ --profile=log:heapshot\ program.exe\f[]
+.PP
+To perform a sampling profiler run, use the \f[I]sample\f[] option:
+.PP
+\f[B]mono\ --profile=log:sample\ program.exe\f[]
.SS Profiler option documentation
.PP
By default the \f[I]log\f[] profiler will gather all the events
\f[I]NUM\f[]ms: perform a heap shot if at least \f[I]NUM\f[]
milliseconds passed since the last one.
.IP \[bu] 2
-\f[I]NUM\f[]gc: perform a heap shot every \f[I]NUM\f[] garbage
-collections (either minor or major).
+\f[I]NUM\f[]gc: perform a heap shot every \f[I]NUM\f[] major
+garbage collections
+.IP \[bu] 2
+\f[I]ondemand\f[]: perform a heap shot when such a command is sent
+to the control port
+.RE
+.IP \[bu] 2
+\f[I]sample[=TYPE[/FREQ]]\f[]: collect statistical samples of the
+program behaviour.
+The default is to collect a 100 times per second (100 Hz) the
+instruction pointer.
+This is equivalent to the value \[lq]cycles/100\[rq] for
+\f[I]TYPE\f[].
+On some systems, like with recent Linux kernels, it is possible to
+cause the sampling to happen for other events provided by the
+performance counters of the cpu.
+In this case, \f[I]TYPE\f[] can be one of:
+.RS 2
+.IP \[bu] 2
+\f[I]cycles\f[]: processor cycles
+.IP \[bu] 2
+\f[I]instr\f[]: executed instructions
+.IP \[bu] 2
+\f[I]cacherefs\f[]: cache references
+.IP \[bu] 2
+\f[I]cachemiss\f[]: cache misses
+.IP \[bu] 2
+\f[I]branches\f[]: executed branches
+.IP \[bu] 2
+\f[I]branchmiss\f[]: mispredicted branches
.RE
.IP \[bu] 2
\f[I]time=TIMER\f[]: use the TIMER timestamp mode.
collect \f[I]NUM\f[] frames at the most.
The default is 8.
.IP \[bu] 2
+\f[I]maxsamples=NUM\f[]: stop allocating reusable sample events
+once \f[I]NUM\f[] events have been allocated (a value of zero for
+all intents and purposes means unlimited). By default, the value
+of this setting is the number of CPU cores multiplied by 1000. This
+is usually a good enough value for typical desktop and mobile apps.
+If you're losing too many samples due to this default (which is
+possible in apps with an unusually high amount of threads), you
+may want to tinker with this value to find a good balance between
+sample hit rate and performance impact on the app. The way it works
+is that sample events are enqueued for reuse after they're flushed
+to the output file; if a thread gets a sampling signal but there are
+no sample events in the reuse queue and the profiler has reached the
+maximum number of sample allocations, the sample gets dropped. So a
+higher number for this setting will increase the chance that a
+thread is able to collect a sample, but also necessarily means that
+there will be more work done by the profiler. You can run Mono with
+the \f[I]--stats\f[] option to see statistics about sample events.
+.IP \[bu] 2
\f[I]calldepth=NUM\f[]: ignore method enter/leave events when the
call chain depth is bigger than NUM.
.IP \[bu] 2
This is equivalent to the option: \f[B]output=mprof-report\ -\f[].
If the \f[I]output\f[] option is specified as well, the report will
be written to the output file instead of the console.
+.IP \[bu] 2
+\f[I]port=PORT\f[]: specify the tcp/ip port to use for the
+listening command server.
+Currently not available for windows.
+This server is started for example when heapshot=ondemand is used:
+it will read commands line by line.
+The following commands are available:
+.RS 2
+.IP \[bu] 2
+\f[I]heapshot\f[]: perform a heapshot as soon as possible
+.RE
+.IP \[bu] 2
+\f[I]counters\f[]: sample counters values every 1 second. This allow
+a really lightweight way to have insight in some of the runtime key
+metrics. Counters displayed in non verbose mode are : Methods from AOT,
+Methods JITted using mono JIT, Methods JITted using LLVM, Total time
+spent JITting (sec), User Time, System Time, Total Time, Working Set,
+Private Bytes, Virtual Bytes, Page Faults and CPU Load Average (1min,
+5min and 15min).
+.RE
.SS Analyzing the profile data
.PP
Currently there is a command line program (\f[I]mprof-report\f[])
.IP \[bu] 2
\f[I]bytes\f[]: the total number of bytes used by objects of the
given type
+.PP
+To change the sort order of counters, use the option:
+.PP
+\f[B]--counters-sort=MODE\f[]
+.PP
+where \f[I]MODE\f[] can be:
+.IP \[bu] 2
+\f[I]time\f[]: sort values by time then category
+.IP \[bu] 2
+\f[I]category\f[]: sort values by category then time
.SS Selecting what data to report
.PP
The profiler by default collects data about many runtime subsystems
where the report names R1, R2 etc.
can be:
.IP \[bu] 2
+\f[I]header\f[]: information about program startup and profiler
+version
+.IP \[bu] 2
+\f[I]jit\f[]: JIT compiler information
+.IP \[bu] 2
+\f[I]sample\f[]: statistical sampling information
+.IP \[bu] 2
\f[I]gc\f[]: garbage collection information
.IP \[bu] 2
\f[I]alloc\f[]: object allocation information
.IP \[bu] 2
\f[I]thread\f[]: thread information
.IP \[bu] 2
+\f[I]domain\f[]: app domain information
+.IP \[bu] 2
+\f[I]context\f[]: remoting context information
+.IP \[bu] 2
\f[I]heapshot\f[]: live heap usage at heap shots
+.IP \[bu] 2
+\f[I]counters\f[]: counters samples
.PP
It is possible to limit some of the data displayed to a timeframe
of the program execution with the option:
.PP
will find all the byte arrays that are at least 10000 bytes in
size.
+.PP
+Note that with a moving garbage collector the object address can
+change, so you may need to track the changed address manually.
+It can also happen that multiple objects are allocated at the same
+address, so the output from this option can become large.
.SS Saving a profiler report
.PP
By default mprof-report will print the summary data to the console.
slower.
There are several ways to reduce the impact of the profiler on the
program execution.
+.SS Use the statistical sampling mode
+.PP
+Statistical sampling allows executing a program under the profiler
+with minimal performance overhead (usually less than 10%).
+This mode allows checking where the program is spending most of
+it's execution time without significantly perturbing its behaviour.
.SS Collect less data
.PP
Collecting method enter/leave events can be very expensive,
up this operation, but, depending on the system, time accounting
may have some level of approximation (though statistically the data
should be still fairly valuable).
-.SS Use a statistical profiler instead
-.PP
-See the mono manpage for the use of a statistical (sampling)
-profiler.
-The \f[I]log\f[] profiler will be enhanced to provide sampling info
-in the future.
.SS Dealing with the size of the data files
.PP
When collecting a lot of information about a profiled program, huge
.PP
\f[B]output=|mprof-report\ --reports=monitor\ --traces\ -\f[]
.SH WEB SITE
-http://www.mono-project.com/Profiler
+http://www.mono-project.com/docs/debug+profile/profile/profiler/
.SH SEE ALSO
.PP
mono(1)