84
84
runtime characteristic like time
85
85
numbers spent in functions and code lines usually is enough.
86
86
This is called Profiling. The program is run under control of a profiling tool, which gives the summary of an execution run at the end.
87
In contrast, for parallel code, performance problems typically are caused when one processor is waiting for data from another. As this waiting time usually can not easily attributed, here it is better to generate timestamped event traces. KCachegrind can not visualize this kind of data.
87
In contrast, for parallel code, performance problems typically are caused when one processor is waiting for data from another. As this waiting time usually cannot easily attributed, here it is better to generate timestamped event traces. &kcachegrind; cannot visualize this kind of data.
98
98
<sect1 id="introduction-methods">
99
99
<title>Profiling Methods</title>
101
<para>To exactly measure the time passed or record the events happening during the execution of a code region (e.g. a function), additional measurement code needs to be inserted before and after the given region. This code reads the time, or a global event count, and calculates differences. Thus, the original code has to be changed before execution. This is called instrumentation. Instrumentation can be done by the programmer itself, the compiler, or by the runtime system. As interesting regions usually are nested, the overhead of measurement always influences the measurement itself. Thus, instrumentation should be done selectively and results have to be interpreted with care. Of course, this makes performance analysis by exact measurement a very complex process.</para>
101
<para>To exactly measure the time passed or record the events happening during the execution of a code region (⪚ a function), additional measurement code needs to be inserted before and after the given region. This code reads the time, or a global event count, and calculates differences. Thus, the original code has to be changed before execution. This is called instrumentation. Instrumentation can be done by the programmer itself, the compiler, or by the runtime system. As interesting regions usually are nested, the overhead of measurement always influences the measurement itself. Thus, instrumentation should be done selectively and results have to be interpreted with care. Of course, this makes performance analysis by exact measurement a very complex process.</para>
103
103
<para>Exact measurement is possible because of hardware counters (including counters incrementing on a time tick) provided in modern processors, which are incremented whenever an event is happening. As we want to attribute events to code regions, without the counters, we would have to handle every event by incrementing a counter for the current code region ourself. Doing this in software is, of course, not possible; but, on the assumption that the event distribution over source code is similar when looking only at every n-th event instead of every event, a measurement method whose overhead is tunable has been developed: it is called Sampling. Time Based Sampling (TBS) uses a timer to regularly look at the program counter to create a histogram over the program code. Event Based Sampling (EBS) exploits the hardware counters of modern processors, and uses a mode where an interrupt handler is called on counter underflow to generate a histogram of the corresponding event distribution: in the handler, the event counter is always reinitialized to the 'n' of the sampling method. The advantage of sampling
104
104
is that the code does not have to be changed, but it is still a compromise: the above assumption will be more correct if n is small, but the smaller the n, the higher the overhead of the interrupt handler.</para>
106
<para>Another measurement method is to simulate things happening in the computer system when executing a given code, i.e. execution driven simulation. The simulation is always derived from a more or less accurate machine model; however, with very detailed machine models, giving very close approximations to reality, the simulation time can be unacceptably high in practice. The advantage of simulation is that arbitrarily complex measurement/simulation code can be inserted in a given code without perturbing results. Doing this directly before execution (called runtime instrumentation), using the original binary, is very comfortable for the user: no re-compilation is necessary. Simulation becomes usable when simulating only parts of a machine with a simple model; another advantage is that the results produced by simple models are often easier to understand: often, the problem with real hardware is that results include overlapping effects from different parts of the machine.</para>
106
<para>Another measurement method is to simulate things happening in the computer system when executing a given code, &ie; execution driven simulation. The simulation is always derived from a more or less accurate machine model; however, with very detailed machine models, giving very close approximations to reality, the simulation time can be unacceptably high in practice. The advantage of simulation is that arbitrarily complex measurement/simulation code can be inserted in a given code without perturbing results. Doing this directly before execution (called runtime instrumentation), using the original binary, is very comfortable for the user: no re-compilation is necessary. Simulation becomes usable when simulating only parts of a machine with a simple model; another advantage is that the results produced by simple models are often easier to understand: often, the problem with real hardware is that results include overlapping effects from different parts of the machine.</para>
109
109
<sect1 id="introduction-tools">
116
116
which can be transformed into human-readable form with
117
117
<command>gprof</command>. One disadvantage is the needed re-compilation
118
118
step to prepare the executable, which has to be statically linked.
119
The method used here is compiler-generated instrumention - which measures call arcs happening among functions and corresponding call counts - in conjunction with TBS - which gives a histogram of time distribution over the code. Using both pieces of information, it is possible to heuristically calculate inclusive time of functions, i.e. time spent in a function together with all functions called from it.
119
The method used here is compiler-generated instrumention - which measures call arcs happening among functions and corresponding call counts - in conjunction with TBS - which gives a histogram of time distribution over the code. Using both pieces of information, it is possible to heuristically calculate inclusive time of functions, &ie; time spent in a function together with all functions called from it.
122
<para>For exact measurement of events happening, libraries exist with functions able to read out hardware performance counters. Most known here is the PerfCtr patch for Linux, and the architecture independent libraries PAPI and PCL. Still, exact measurement needs instrumentation of code, as stated above. Either one uses the libraries itself or uses automatic instrumentation systems like ADAPTOR (for FORTRAN source instrumentation) or DynaProf (code injection via DynInst).</para>
122
<para>For exact measurement of events happening, libraries exist with functions able to read out hardware performance counters. Most known here is the PerfCtr patch for &Linux;, and the architecture independent libraries PAPI and PCL. Still, exact measurement needs instrumentation of code, as stated above. Either one uses the libraries itself or uses automatic instrumentation systems like ADAPTOR (for FORTRAN source instrumentation) or DynaProf (code injection via DynInst).</para>
124
<para>&oprofile; is a system-wide profiling tool for Linux using Sampling.</para>
124
<para>&oprofile; is a system-wide profiling tool for &Linux; using Sampling.</para>
127
127
In many aspects, a comfortable way of Profiling is using
128
Cachegrind or Callgrind, which are simulators using the runtime
128
&cachegrind; or &callgrind;, which are simulators using the runtime
129
129
instrumentation framework &valgrind;. Because there is no need
130
to access hardware counters (often difficult with today's Linux installations), and binaries to be profiled can be left unmodified,
130
to access hardware counters (often difficult with today's &Linux; installations), and binaries to be profiled can be left unmodified,
131
131
it is a good alternative to other profiling tools.
132
The disadvantage of simulation - slowdown - can be reduced by doing the simulation on only the interesting program parts, and perhaps only on a few iterations of a loop. Without measurement/simulation instrumentation, Valgrind's usage only has a slowdown factor in the range of 3 to 5.
132
The disadvantage of simulation - slowdown - can be reduced by doing the simulation on only the interesting program parts, and perhaps only on a few iterations of a loop. Without measurement/simulation instrumentation, &valgrind;'s usage only has a slowdown factor in the range of 3 to 5.
133
133
Also, when only the call graph and call counts
134
134
are of interest, the cache simulator can be switched off.
147
147
the run program. By combining these miss counts, using miss latencies from typical processors, an estimation of spent time can be given.
150
<para>Callgrind is an extension of &cachegrind; that
150
<para>&callgrind; is an extension of &cachegrind; that
151
151
builds up the call graph of a program on-the-fly,
152
152
&ie; how the functions call each other and how many events happen
153
153
while running a function. Also, the profile data to be collected
164
164
Profiling tools typically produce a large amount of data. The wish
165
to easily browse down and up the call graph, together with fast switching of the sorting mode of functions and display of different event types, motivates a GUI application to accomplish this task.
165
to easily browse down and up the call graph, together with fast switching of the sorting mode of functions and display of different event types, motivates a &GUI; application to accomplish this task.
169
&kappname; is an visualization for profile data fulfilling these wishes. Despite being programmed first with browsing the data from &cachegrind; and &calltree; in mind, there are converters available to be able to display profile data produced by other tools. In the appendix, a description of the Cachegrind/Callgrind file format is given.
169
&kappname; is an visualization for profile data fulfilling these wishes. Despite being programmed first with browsing the data from &cachegrind; and &calltree; in mind, there are converters available to be able to display profile data produced by other tools. In the appendix, a description of the &cachegrind;/&callgrind; file format is given.
247
247
&oprofile; is available from
248
248
<ulink url="http://oprofile.sf.net">
249
249
http://oprofile.sf.net</ulink>. Follow the installation instructions on the web site;
250
but, before you do, check if your distribution does not already provide it as package (like SuSE).
250
but, before you do, check if your distribution does not already provide it as package (like &SuSE;).
254
254
System-wide profiling is only permitted to the root user,
255
255
as all actions on the system can be observed; therefore, the following has to be done as root.
256
First, configure the profiling process, using the GUI <command>oprof_start</command> or the
256
First, configure the profiling process, using the &GUI; <command>oprof_start</command> or the
257
257
command-line tool opcontrol. Standard configuration should be timer mode (TBS, see introduction). To start the measurement, run
258
258
<command>opcontrol -s</command>. Then run the application you are interested in and, afterwards, do a <command>opcontrol -d</command>. This will write out
259
259
the measurement results into files under directory <filename>/var/lib/oprofile/samples/</filename>. To be able to visualize the data in &kcachegrind;, do in an empty directory:
299
To explore the GUI further, in addition to this manual, also have a look at the documentation section on the web site
299
To explore the &GUI; further, in addition to this manual, also have a look at the documentation section on the web site
300
300
<ulink url="http://kcachegrind.sf.net">
301
301
http://kcachegrind.sf.net</ulink>.
337
337
</para></listitem>
340
All source lines of a given function make up the function itself. A function is specified by its name and its location in some binary object if available. The latter is needed because binary objects of a single program each can hold functions with the same name (these can be accessed e.g. with dlopen/dlsym; the runtime linker resolves functions in a given search order of binary objects used). If a profiling tool can not detect the symbol name of a function, e.g. because debug information is not available, either the address of the first executed instruction typically is used, or "???".
340
All source lines of a given function make up the function itself. A function is specified by its name and its location in some binary object if available. The latter is needed because binary objects of a single program each can hold functions with the same name (these can be accessed ⪚ with dlopen/dlsym; the runtime linker resolves functions in a given search order of binary objects used). If a profiling tool cannot detect the symbol name of a function, ⪚ because debug information is not available, either the address of the first executed instruction typically is used, or "???".
341
341
</para></listitem>
349
349
</para></listitem>
352
Symbol names of functions typically are hierarchically ordered in name spaces, e.g. C++ namespaces, or classes of object oriented languages; thus, a class can hold functions of the class or embedded classes itself.
352
Symbol names of functions typically are hierarchically ordered in name spaces, ⪚ C++ namespaces, or classes of object oriented languages; thus, a class can hold functions of the class or embedded classes itself.
353
353
</para></listitem>
404
404
<title>Visualization State</title>
407
The Visualization state of a KCachegrind window includes:
407
The Visualization state of a &kcachegrind; window includes:
410
410
the primary and secondary event type chosen for display,
416
416
the profile parts whose costs are to be included in visualization,
417
417
</para></listitem>
419
an active cost entity (e.g. a function selected from the function profile dockable),
419
an active cost entity (⪚ a function selected from the function profile dockable),
420
420
</para></listitem>
422
422
a selected cost entity.
426
426
This state influences visualizations.
429
Visualizations always are shown for one, the active, cost entity. When a given visualization is not appropriate for a cost entity, it is disabled (e.g. when selecting an ELF object in the group list by double-clicking, source annotation for an ELF object make no sense).
429
Visualizations always are shown for one, the active, cost entity. When a given visualization is not appropriate for a cost entity, it is disabled (⪚ when selecting an &ELF; object in the group list by double-clicking, source annotation for an &ELF; object make no sense).
432
432
For example, for an active function, the callee list shows all the functions called from the active one: one can select one of these functions without making it active; also, if the call-graph is shown nearside, it will automatically select the same function.
437
437
<sect1 id="concepts-guiparts">
438
<title>Parts of the GUI</title>
438
<title>Parts of the &GUI;</title>
441
441
<title>Sidedocks</title>
443
Sidedocks (Dockables) are side windows which can be placed at any border of an KCachegrind window. They always contain a list of cost entities sorted in some manner.
443
Sidedocks (Dockables) are side windows which can be placed at any border of an &kcachegrind; window. They always contain a list of cost entities sorted in some manner.
446
446
Function Profile.
460
460
<title>Visualization Area</title>
462
The visualization area, typically the right part of a KCachegrind main window, is made up of one (default) or more Tab Views, either lined up horizontally or vertically. Each tab view holds different visualization views of only one cost entity at a time. The name of this entity is given at the top of the tab view. If there are multiple tab views, only one is active. The entity name in the active tab view is shown in bold and determines the active cost entity of the KCachegrind window.
462
The visualization area, typically the right part of a &kcachegrind; main window, is made up of one (default) or more Tab Views, either lined up horizontally or vertically. Each tab view holds different visualization views of only one cost entity at a time. The name of this entity is given at the top of the tab view. If there are multiple tab views, only one is active. The entity name in the active tab view is shown in bold and determines the active cost entity of the &kcachegrind; window.
488
488
<title>Layouts</title>
490
The layout of all the tab views of a window can be saved (see menu item View/Layout). After duplicating the current layout (Ctrl+Plus or menu) and changing some sizes or moving a visualization view to another area of an tab view, you can quickly switch between the old and the new layout via Ctrl+Left/Right. The set of layouts will be stored between KCachegrind sessions of the same profiled command. You can make the current set of layouts as the default one for new KCachegrind sessions, or restore to the default layout set.
490
The layout of all the tab views of a window can be saved (see menu item View/Layout). After duplicating the current layout (Ctrl+Plus or menu) and changing some sizes or moving a visualization view to another area of an tab view, you can quickly switch between the old and the new layout via Ctrl+Left/Right. The set of layouts will be stored between &kcachegrind; sessions of the same profiled command. You can make the current set of layouts as the default one for new &kcachegrind; sessions, or restore to the default layout set.
501
501
The flat profile contains a group list and a function list. The group list contains all groups where cost is spent in, depending on the chosen group type. The group list is hidden when grouping is switched off.
504
The function list contains the functions of the selected group (or all functions if grouping is switched off), ordered by some column, e.g. inclusive or self costs spent therein. There is a maximal number of functions shown in the list, which is configurable in Settings/Configure KCachegrind.
504
The function list contains the functions of the selected group (or all functions if grouping is switched off), ordered by some column, ⪚ inclusive or self costs spent therein. There is a maximal number of functions shown in the list, which is configurable in Settings/Configure &kcachegrind;.
509
509
<title>Parts Overview</title>
511
In a profile run, multiple profile data files can be produced, which can be loaded together into KCachegrind. The Parts Overview dockable shows these, horizontally ordered according creation time; the rectangle sizes are proportional to the cost spent in the parts. You can select one or several parts to constrain the costs shown in the other KCachegrind views to these parts only.
511
In a profile run, multiple profile data files can be produced, which can be loaded together into &kcachegrind;. The Parts Overview dockable shows these, horizontally ordered according creation time; the rectangle sizes are proportional to the cost spent in the parts. You can select one or several parts to constrain the costs shown in the other &kcachegrind; views to these parts only.
514
514
The parts are further subdivided: there is a partitioning and an inclusive cost split mode:
517
Partitioning: You see the partitioning into groups for a profile data part, according to the group type selected. For example, if ELF object groups are selected, you see colored rectangles for each used ELF object (shared library or executable), sized according to the cost spent therein.
517
Partitioning: You see the partitioning into groups for a profile data part, according to the group type selected. For example, if &ELF; object groups are selected, you see colored rectangles for each used &ELF; object (shared library or executable), sized according to the cost spent therein.
518
518
</para></listitem>
520
520
Inclusive Cost Split: A rectangle showing the inclusive cost of the current active function in the part is shown. This, again, is split up to show inclusive costs of its callees.
543
543
This list shows all cost types available and the corresponding self and inclusive cost of the current active function for that event type.
546
By choosing an event type from the list, you change the type of costs shown all over KCachegrind to be the selected one.
546
By choosing an event type from the list, you change the type of costs shown all over &kcachegrind; to be the selected one.
591
591
<title>Call Graph</title>
593
593
This view shows the call graph around the active function.
594
The shown cost is only the cost which is spent while the active function was actually running; i.e. the cost shown for main() - if it's visible - should be the same as the cost of the active function, as that is the part of inclusive cost of main() spent while the active function was running.
594
The shown cost is only the cost which is spent while the active function was actually running; &ie; the cost shown for main() - if it's visible - should be the same as the cost of the active function, as that is the part of inclusive cost of main() spent while the active function was running.
597
597
For cycles, blue call arrows indicate that this is an artificial call added for correct drawing which actually never happened.
800
<para>The toolbar/menubar of my KCachegrind looks so spartanic. Is this normal?</para>
799
<para>The toolbar/menubar of my &kcachegrind; looks so spartanic. Is this normal?</para>
804
Obviously KCachegrind is wrongly installed on your system. It is recommended to compile it with the installation prefix to be your system wide KDE base directory like <command>configure --prefix=/opt/kde3; make install</command>.
805
If you choose another directory like $HOME/kde, you should set the environment variable KDEDIR to this directory before running KCachegrind.
803
Obviously &kcachegrind; is wrongly installed on your system. It is recommended to compile it with the installation prefix to be your system wide &kde; base directory like <command>configure --prefix=/opt/kde3; make install</command>.
804
If you choose another directory like $HOME/kde, you should set the environment variable KDEDIR to this directory before running &kcachegrind;.
852
851
Profile Project: A configuration for profile experiments used for one program which has to be profiled, perhaps in multiple versions. Comparisons of profile data typically only makes sense between profile data produced in experiments of one profile project.
853
852
</para></listitem>
855
Cost Entity: An abstract item related to source code to which event counts can be attributed. Dimensions for cost entities are code location (e.g. source line, function), data location (e.g. accessed data type, data object), execution location (e.g. thread, process), and tuples or triples of the aforementioned positions (e.g. calls, object access from statement, evicted data from cache).
854
Cost Entity: An abstract item related to source code to which event counts can be attributed. Dimensions for cost entities are code location (⪚ source line, function), data location (⪚ accessed data type, data object), execution location (⪚ thread, process), and tuples or triples of the aforementioned positions (⪚ calls, object access from statement, evicted data from cache).
856
855
</para></listitem>
858
857
Event Type: The kind of event of which costs can be attributed to a cost entity. There exist real event types and inherited event types.
882
881
Thanks to Julian Seward for his excellent &valgrind;, and Nicholas
883
882
Nethercote for the &cachegrind; addition. Without these programs,
884
<application>KCachegrind</application> would not exist. Some ideas
883
&kcachegrind; would not exist. Some ideas
885
884
for this &GUI; were from them, too.