56
59
<para>To use Callgrind, you must specify
57
60
<computeroutput>--tool=callgrind</computeroutput> on the Valgrind
58
command line or use the supplied script
59
<computeroutput>callgrind</computeroutput>.</para>
63
<sect2 id="cl-manual.functionality" xreflabel="Functionality">
64
<title>Functionality</title>
66
<para>Cachegrind collects flat profile data: event counts (data reads,
67
cache misses, etc.) are attributed directly to the function they
68
occurred in. This cost attribution mechanism is
69
called <emphasis>self</emphasis> or <emphasis>exclusive</emphasis>
72
<para>Callgrind extends this functionality by propagating costs
73
across function call boundaries. If function <code>foo</code> calls
74
<code>bar</code>, the costs from <code>bar</code> are added into
75
<code>foo</code>'s costs. When applied to the program as a whole,
76
this builds up a picture of so called <emphasis>inclusive</emphasis>
77
costs, that is, where the cost of each function includes the costs of
78
all functions it called, directly or indirectly.</para>
80
<para>As an example, the inclusive cost of
81
<computeroutput>main</computeroutput> should be almost 100 percent
82
of the total program cost. Because of costs arising before
83
<computeroutput>main</computeroutput> is run, such as
84
initialization of the run time linker and construction of global C++
85
objects, the inclusive cost of <computeroutput>main</computeroutput>
86
is not exactly 100 percent of the total program cost.</para>
88
<para>Together with the call graph, this allows you to find the
89
specific call chains starting from
90
<computeroutput>main</computeroutput> in which the majority of the
91
program's costs occur. Caller/callee cost attribution is also useful
92
for profiling functions called from multiple call sites, and where
93
optimization opportunities depend on changing code in the callers, in
94
particular by reducing the call count.</para>
61
96
<para>Callgrind's cache simulation is based on the
62
<ulink url="&cg-tool-url;">Cachegrind tool</ulink> of the
63
<ulink url="&vg-url;">Valgrind</ulink> package. Read
64
<ulink url="&cg-doc-url;">Cachegrind's documentation</ulink> first;
65
this page describes the features supported in addition to
97
<ulink url="&cg-tool-url;">Cachegrind tool</ulink>. Read
98
<ulink url="&cg-doc-url;">Cachegrind's documentation</ulink> first.
99
The material below describes the features supported in addition to
66
100
Cachegrind's features.</para>
71
<sect1 id="cl-manual.purpose" xreflabel="Purpose">
72
<title>Purpose</title>
75
<sect2 id="cl-manual.devel"
76
xreflabel="Profiling as part of Application Development">
77
<title>Profiling as part of Application Development</title>
79
<para>With application development, a common step is
80
to improve runtime performance. To not waste time on
81
optimizing functions which are rarely used, one needs to know
82
in which parts of the program most of the time is spent.</para>
84
<para>This is done with a technique called profiling. The program
85
is run under control of a profiling tool, which gives the time
86
distribution of executed functions in the run. After examination
87
of the program's profile, it should be clear if and where optimization
88
is useful. Afterwards, one should verify any runtime changes by another
94
<sect2 id="cl-manual.tools" xreflabel="Profiling Tools">
95
<title>Profiling Tools</title>
97
<para>Most widely known is the GCC profiling tool <command>GProf</command>:
98
one needs to compile an application with the compiler option
99
<computeroutput>-pg</computeroutput>. Running the program generates
100
a file <computeroutput>gmon.out</computeroutput>, which can be
101
transformed into human readable form with the command line tool
102
<computeroutput>gprof</computeroutput>. A disadvantage here is the
103
the need to recompile everything, and also the need to statically link the
106
<para>Another profiling tool is <command>Cachegrind</command>, part
107
of <ulink url="&vg-url;">Valgrind</ulink>. It uses the processor
108
emulation of Valgrind to run the executable, and catches all memory
109
accesses, which are used to drive a cache simulator.
110
The program does not need to be
111
recompiled, it can use shared libraries and plugins, and the profile
112
measurement doesn't influence the memory access behaviour.
114
the number of instruction/data memory accesses and 1st/2nd level
115
cache misses, and relates it to source lines and functions of the
116
run program. A disadvantage is the slowdown involved in the
117
processor emulation, around 50 times slower.</para>
119
<para>Cachegrind can only deliver a flat profile. There is no call
120
relationship among the functions of an application stored. Thus,
121
inclusive costs, i.e. costs of a function including the cost of all
122
functions called from there, cannot be calculated. Callgrind extends
123
Cachegrind by including call relationship and exact event counts
124
spent while doing a call.</para>
126
<para>Because Callgrind (and Cachegrind) is based on simulation, the
127
slowdown due to processing the synthetic runtime events does not
128
influence the results. See <xref linkend="cl-manual.usage"/> for more
129
details on the possibilities.</para>
136
<sect1 id="cl-manual.usage" xreflabel="Usage">
139
<sect2 id="cl-manual.basics" xreflabel="Basics">
140
<title>Basics</title>
102
<para>Callgrind's ability to detect function calls and returns depends
103
on the instruction set of the platform it is run on. It works best
104
on x86 and amd64, and unfortunately currently does not work so well
105
on PowerPC code. This is because there are no explicit call or return
106
instructions in the PowerPC instruction set, so Callgrind has to rely
107
on heuristics to detect calls and returns.</para>
111
<sect2 id="cl-manual.basics" xreflabel="Basic Usage">
112
<title>Basic Usage</title>
114
<para>As with Cachegrind, you probably want to compile with debugging info
115
(the -g flag), but with optimization turned on.</para>
142
117
<para>To start a profile run for a program, execute:
143
<screen>callgrind [callgrind options] your-program [program options]</screen>
118
<screen>valgrind --tool=callgrind [callgrind options] your-program [program options]</screen>
146
121
<para>While the simulation is running, you can observe execution with
147
122
<screen>callgrind_control -b</screen>
148
This will print out a current backtrace. To annotate the backtrace with
123
This will print out the current backtrace. To annotate the backtrace with
149
124
event counts, run
150
125
<screen>callgrind_control -e -b</screen>
153
128
<para>After program termination, a profile data file named
154
<computeroutput>callgrind.out.pid</computeroutput>
155
is generated with <emphasis>pid</emphasis> being the process ID
156
of the execution of this profile run.</para>
158
<para>The data file contains information about the calls made in the
129
<computeroutput>callgrind.out.<pid></computeroutput>
130
is generated, where <emphasis>pid</emphasis> is the process ID
131
of the program being profiled.
132
The data file contains information about the calls made in the
159
133
program among the functions executed, together with events of type
160
134
<command>Instruction Read Accesses</command> (Ir).</para>
136
<para>To generate a function-by-function summary from the profile
138
<screen>callgrind_annotate [options] callgrind.out.<pid></screen>
139
This summary is similar to the output you get from a Cachegrind
140
run with <computeroutput>cg_annotate</computeroutput>: the list
141
of functions is ordered by exclusive cost of functions, which also
142
are the ones that are shown.
143
Important for the additional features of Callgrind are
144
the following two options:</para>
148
<para><option>--inclusive=yes</option>: Instead of using
149
exclusive cost of functions as sorting order, use and show
150
inclusive cost.</para>
154
<para><option>--tree=both</option>: Interleave into the
155
top level list of functions, information on the callers and the callees
156
of each function. In these lines, which represents executed
157
calls, the cost gives the number of events spent in the call.
158
Indented, above each function, there is the list of callers,
159
and below, the list of callees. The sum of events in calls to
160
a given function (caller lines), as well as the sum of events in
161
calls from the function (callee lines) together with the self
162
cost, gives the total inclusive cost of the function.</para>
166
<para>Use <option>--auto=yes</option> to get annotated source code
167
for all relevant functions for which the source can be found. In
168
addition to source annotation as produced by
169
<computeroutput>cg_annotate</computeroutput>, you will see the
170
annotated call sites with call counts. For all other options,
171
consult the (Cachegrind) documentation for
172
<computeroutput>cg_annotate</computeroutput>.
175
<para>For better call graph browsing experience, it is highly recommended
176
to use <ulink url="&cl-gui;">KCachegrind</ulink>.
178
has a significant fraction of its cost in <emphasis>cycles</emphasis> (sets
179
of functions calling each other in a recursive manner), you have to
180
use KCachegrind, as <computeroutput>callgrind_annotate</computeroutput>
181
currently does not do any cycle detection, which is important to get correct
182
results in this case.</para>
162
184
<para>If you are additionally interested in measuring the
163
cache behaviour of your
185
cache behavior of your
164
186
program, use Callgrind with the option
165
187
<option><xref linkend="opt.simulate-cache"/>=yes.</option>
166
This will further slow down the run approximately by a factor of 2.</para>
188
However, expect a further slow down approximately by a factor of 2.</para>
168
190
<para>If the program section you want to profile is somewhere in the
169
191
middle of the run, it is beneficial to
170
192
<emphasis>fast forward</emphasis> to this section without any
171
profiling at all, and switch it on later. This is achieved by using
193
profiling, and then switch on profiling. This is achieved by using
194
the command line option
172
195
<option><xref linkend="opt.instr-atstart"/>=no</option>
173
and interactively use
174
<computeroutput>callgrind_control -i on</computeroutput> before the
175
interesting code section is about to be executed.</para>
196
and running, in a shell,
197
<computeroutput>callgrind_control -i on</computeroutput> just before the
198
interesting code section is executed. To exactly specify
199
the code position where profiling should start, use the client request
200
<computeroutput>CALLGRIND_START_INSTRUMENTATION</computeroutput>.</para>
177
<para>If you want to be able to see assembler annotation, specify
202
<para>If you want to be able to see assembly code level annotation, specify
178
203
<option><xref linkend="opt.dump-instr"/>=yes</option>. This will produce
179
204
profile data at instruction granularity. Note that the resulting profile
181
can only be viewed with KCachegrind. For assembler annotation, it also is
206
can only be viewed with KCachegrind. For assembly annotation, it also is
182
207
interesting to see more details of the control flow inside of functions,
183
208
ie. (conditional) jumps. This will be collected by further specifying
184
209
<option><xref linkend="opt.collect-jumps"/>=yes</option>.</para>
215
<sect1 id="cl-manual.usage" xreflabel="Advanced Usage">
216
<title>Advanced Usage</title>
189
218
<sect2 id="cl-manual.dumps"
190
219
xreflabel="Multiple dumps from one program run">
191
220
<title>Multiple profiling dumps from one program run</title>
193
<para>Often, you aren't interested in time characteristics of a full
194
program run, but only of a small part of it (e.g. execution of one
195
algorithm). If there are multiple algorithms or one algorithm
196
running with different input data, it's even useful to get different
197
profile information for multiple parts of one program run.</para>
222
<para>Sometimes you are not interested in characteristics of a full
223
program run, but only of a small part of it, for example execution of one
224
algorithm. If there are multiple algorithms, or one algorithm
225
running with different input data, it may even be useful to get different
226
profile information for different parts of a single program run.</para>
199
228
<para>Profile data files have names of the form
345
373
<sect2 id="cl-manual.cycles" xreflabel="Avoiding cycles">
346
374
<title>Avoiding cycles</title>
348
<para>Each group of functions with any two of them happening to have a
349
call chain from one to the other, is called a cycle. For example,
350
with A calling B, B calling C, and C calling A, the three functions
351
A,B,C build up one cycle.</para>
353
<para>If a call chain goes multiple times around inside of a cycle,
354
with profiling, you can not distinguish event counts coming from the
355
first round or the second. Thus, it makes no sense to attach any inclusive
356
cost to a call among functions inside of one cycle.
357
If "A > B" appears multiple times in a call chain, you
358
have no way to partition the one big sum of all appearances of "A >
359
B". Thus, for profile data presentation, all functions of a cycle are
360
seen as one big virtual function.</para>
362
<para>Unfortunately, if you have an application using some callback
363
mechanism (like any GUI program), or even with normal polymorphism (as
364
in OO languages like C++), it's quite possible to get large cycles.
365
As it is often impossible to say anything about performance behaviour
366
inside of cycles, it is useful to introduce some mechanisms to avoid
367
cycles in call graphs. This is done by treating the same
368
function in different ways, depending on the current execution
369
context, either by giving them different names, or by ignoring calls to
372
<para>There is an option to ignore calls to a function with
373
<option><xref linkend="opt.fn-skip"/>=funcprefix</option>. E.g., you
374
usually do not want to see the trampoline functions in the PLT sections
376
<para>Informally speaking, a cycle is a group of functions which
377
call each other in a recursive way.</para>
379
<para>Formally speaking, a cycle is a nonempty set S of functions,
380
such that for every pair of functions F and G in S, it is possible
381
to call from F to G (possibly via intermediate functions) and also
382
from G to F. Furthermore, S must be maximal -- that is, be the
383
largest set of functions satisfying this property. For example, if
384
a third function H is called from inside S and calls back into S,
385
then H is also part of the cycle and should be included in S.</para>
387
<para>Recursion is quite usual in programs, and therefore, cycles
388
sometimes appear in the call graph output of Callgrind. However,
389
the title of this chapter should raise two questions: What is bad
390
about cycles which makes you want to avoid them? And: How can
391
cycles be avoided without changing program code?</para>
393
<para>Cycles are not bad in itself, but tend to make performance
394
analysis of your code harder. This is because inclusive costs
395
for calls inside of a cycle are meaningless. The definition of
396
inclusive cost, ie. self cost of a function plus inclusive cost
397
of its callees, needs a topological order among functions. For
398
cycles, this does not hold true: callees of a function in a cycle include
399
the function itself. Therefore, KCachegrind does cycle detection
400
and skips visualization of any inclusive cost for calls inside
401
of cycles. Further, all functions in a cycle are collapsed into artifical
402
functions called like <computeroutput>Cycle 1</computeroutput>.</para>
404
<para>Now, when a program exposes really big cycles (as is
405
true for some GUI code, or in general code using event or callback based
406
programming style), you loose the nice property to let you pinpoint
407
the bottlenecks by following call chains from
408
<computeroutput>main()</computeroutput>, guided via
409
inclusive cost. In addition, KCachegrind looses its ability to show
410
interesting parts of the call graph, as it uses inclusive costs to
411
cut off uninteresting areas.</para>
413
<para>Despite the meaningless of inclusive costs in cycles, the big
414
drawback for visualization motivates the possibility to temporarily
415
switch off cycle detection in KCachegrind, which can lead to
416
misguiding visualization. However, often cycles appear because of
417
unlucky superposition of independent call chains in a way that
418
the profile result will see a cycle. Neglecting uninteresting
419
calls with very small measured inclusive cost would break these
420
cycles. In such cases, incorrect handling of cycles by not detecting
421
them still gives meaningful profiling visualization.</para>
423
<para>It has to be noted that currently, <command>callgrind_annotate</command>
424
does not do any cycle detection at all. For program executions with function
425
recursion, it e.g. can print nonsense inclusive costs way above 100%.</para>
427
<para>After describing why cycles are bad for profiling, it is worth
428
talking about cycle avoidance. The key insight here is that symbols in
429
the profile data do not have to exactly match the symbols found in the
430
program. Instead, the symbol name could encode additional information
431
from the current execution context such as recursion level of the
432
current function, or even some part of the call chain leading to the
433
function. While encoding of additional information into symbols is
434
quite capable of avoiding cycles, it has to be used carefully to not cause
435
symbol explosion. The latter imposes large memory requirement for Callgrind
436
with possible out-of-memory conditions, and big profile data files.</para>
438
<para>A further possibility to avoid cycles in Callgrind's profile data
439
output is to simply leave out given functions in the call graph. Of course, this
440
also skips any call information from and to an ignored function, and thus can
441
break a cycle. Candidates for this typically are dispatcher functions in event
442
driven code. The option to ignore calls to a function is
443
<option><xref linkend="opt.fn-skip"/>=function</option>. Aside from
444
possibly breaking cycles, this is used in Callgrind to skip
445
trampoline functions in the PLT sections
375
446
for calls to functions in shared libraries. You can see the difference
376
447
if you profile with <option><xref linkend="opt.skip-plt"/>=no</option>.
377
If a call is ignored, cost events happening will be attached to the
448
If a call is ignored, its cost events will be propagated to the
378
449
enclosing function.</para>
380
451
<para>If you have a recursive function, you can distinguish the first
381
452
10 recursion levels by specifying
382
<option><xref linkend="opt.fn-recursion-num"/>=funcprefix</option>.
453
<option><xref linkend="opt.separate-recs-num"/>=function</option>.
383
454
Or for all functions with
384
<option><xref linkend="opt.fn-recursion"/>=10</option>, but this will
455
<option><xref linkend="opt.separate-recs"/>=10</option>, but this will
385
456
give you much bigger profile data files. In the profile data, you will see
386
457
the recursion levels of "func" as the different functions with names
387
458
"func", "func'2", "func'3" and so on.</para>
389
460
<para>If you have call chains "A > B > C" and "A > C > B"
390
461
in your program, you usually get a "false" cycle "B <> C". Use
391
<option><xref linkend="opt.fn-caller-num"/>=B</option>
392
<option><xref linkend="opt.fn-caller-num"/>=C</option>,
462
<option><xref linkend="opt.separate-callers-num"/>=B</option>
463
<option><xref linkend="opt.separate-callers-num"/>=C</option>,
393
464
and functions "B" and "C" will be treated as different functions
394
465
depending on the direct caller. Using the apostrophe for appending
395
466
this "context" to the function name, you get "A > B'A > C'B"
396
467
and "A > C'A > B'C", and there will be no cycle. Use
397
<option><xref linkend="opt.fn-caller"/>=3</option> to get a 2-caller
468
<option><xref linkend="opt.separate-callers"/>=2</option> to get a 2-caller
398
469
dependency for all functions. Note that doing this will increase
399
470
the size of profile data files.</para>