~ubuntu-branches/ubuntu/intrepid/kdesdk/intrepid-updates

In contrast, for parallel code, performance problems typically are caused when one processor is waiting for data from another. As this waiting time usually can not easily attributed, here it is better to generate timestamped event traces. KCachegrind can not visualize this kind of data.

In contrast, for parallel code, performance problems typically are caused when one processor is waiting for data from another. As this waiting time usually cannot easily attributed, here it is better to generate timestamped event traces. &kcachegrind; cannot visualize this kind of data.

</para>

<para>

<title>Profiling Methods</title>

100

101

<para>To exactly measure the time passed or record the events happening during the execution of a code region (e.g. a function), additional measurement code needs to be inserted before and after the given region. This code reads the time, or a global event count, and calculates differences. Thus, the original code has to be changed before execution. This is called instrumentation. Instrumentation can be done by the programmer itself, the compiler, or by the runtime system. As interesting regions usually are nested, the overhead of measurement always influences the measurement itself. Thus, instrumentation should be done selectively and results have to be interpreted with care. Of course, this makes performance analysis by exact measurement a very complex process.</para>

101

<para>To exactly measure the time passed or record the events happening during the execution of a code region (&eg; a function), additional measurement code needs to be inserted before and after the given region. This code reads the time, or a global event count, and calculates differences. Thus, the original code has to be changed before execution. This is called instrumentation. Instrumentation can be done by the programmer itself, the compiler, or by the runtime system. As interesting regions usually are nested, the overhead of measurement always influences the measurement itself. Thus, instrumentation should be done selectively and results have to be interpreted with care. Of course, this makes performance analysis by exact measurement a very complex process.</para>

102

103

<para>Exact measurement is possible because of hardware counters (including counters incrementing on a time tick) provided in modern processors, which are incremented whenever an event is happening. As we want to attribute events to code regions, without the counters, we would have to handle every event by incrementing a counter for the current code region ourself. Doing this in software is, of course, not possible; but, on the assumption that the event distribution over source code is similar when looking only at every n-th event instead of every event, a measurement method whose overhead is tunable has been developed: it is called Sampling. Time Based Sampling (TBS) uses a timer to regularly look at the program counter to create a histogram over the program code. Event Based Sampling (EBS) exploits the hardware counters of modern processors, and uses a mode where an interrupt handler is called on counter underflow to generate a histogram of the corresponding event distribution: in the handler, the event counter is always reinitialized to the 'n' of the sampling method. The advantage of sampling

104

is that the code does not have to be changed, but it is still a compromise: the above assumption will be more correct if n is small, but the smaller the n, the higher the overhead of the interrupt handler.</para>

105

106

<para>Another measurement method is to simulate things happening in the computer system when executing a given code, i.e. execution driven simulation. The simulation is always derived from a more or less accurate machine model; however, with very detailed machine models, giving very close approximations to reality, the simulation time can be unacceptably high in practice. The advantage of simulation is that arbitrarily complex measurement/simulation code can be inserted in a given code without perturbing results. Doing this directly before execution (called runtime instrumentation), using the original binary, is very comfortable for the user: no re-compilation is necessary. Simulation becomes usable when simulating only parts of a machine with a simple model; another advantage is that the results produced by simple models are often easier to understand: often, the problem with real hardware is that results include overlapping effects from different parts of the machine.</para>

106

<para>Another measurement method is to simulate things happening in the computer system when executing a given code, &ie; execution driven simulation. The simulation is always derived from a more or less accurate machine model; however, with very detailed machine models, giving very close approximations to reality, the simulation time can be unacceptably high in practice. The advantage of simulation is that arbitrarily complex measurement/simulation code can be inserted in a given code without perturbing results. Doing this directly before execution (called runtime instrumentation), using the original binary, is very comfortable for the user: no re-compilation is necessary. Simulation becomes usable when simulating only parts of a machine with a simple model; another advantage is that the results produced by simple models are often easier to understand: often, the problem with real hardware is that results include overlapping effects from different parts of the machine.</para>

107

</sect1>

108

109

116

which can be transformed into human-readable form with

117

<command>gprof</command>. One disadvantage is the needed re-compilation

118

step to prepare the executable, which has to be statically linked.

119

The method used here is compiler-generated instrumention - which measures call arcs happening among functions and corresponding call counts - in conjunction with TBS - which gives a histogram of time distribution over the code. Using both pieces of information, it is possible to heuristically calculate inclusive time of functions, i.e. time spent in a function together with all functions called from it.

119

The method used here is compiler-generated instrumention - which measures call arcs happening among functions and corresponding call counts - in conjunction with TBS - which gives a histogram of time distribution over the code. Using both pieces of information, it is possible to heuristically calculate inclusive time of functions, &ie; time spent in a function together with all functions called from it.

120

</para>

121

122

<para>For exact measurement of events happening, libraries exist with functions able to read out hardware performance counters. Most known here is the PerfCtr patch for Linux, and the architecture independent libraries PAPI and PCL. Still, exact measurement needs instrumentation of code, as stated above. Either one uses the libraries itself or uses automatic instrumentation systems like ADAPTOR (for FORTRAN source instrumentation) or DynaProf (code injection via DynInst).</para>

122

<para>For exact measurement of events happening, libraries exist with functions able to read out hardware performance counters. Most known here is the PerfCtr patch for &Linux;, and the architecture independent libraries PAPI and PCL. Still, exact measurement needs instrumentation of code, as stated above. Either one uses the libraries itself or uses automatic instrumentation systems like ADAPTOR (for FORTRAN source instrumentation) or DynaProf (code injection via DynInst).</para>

123

124

<para>&oprofile; is a system-wide profiling tool for Linux using Sampling.</para>

124

<para>&oprofile; is a system-wide profiling tool for &Linux; using Sampling.</para>

125

126

<para>

127

In many aspects, a comfortable way of Profiling is using

128

Cachegrind or Callgrind, which are simulators using the runtime

128

&cachegrind; or &callgrind;, which are simulators using the runtime

129

instrumentation framework &valgrind;. Because there is no need

130

to access hardware counters (often difficult with today's Linux installations), and binaries to be profiled can be left unmodified,

130

to access hardware counters (often difficult with today's &Linux; installations), and binaries to be profiled can be left unmodified,

131

it is a good alternative to other profiling tools.

132

The disadvantage of simulation - slowdown - can be reduced by doing the simulation on only the interesting program parts, and perhaps only on a few iterations of a loop. Without measurement/simulation instrumentation, Valgrind's usage only has a slowdown factor in the range of 3 to 5.

132

The disadvantage of simulation - slowdown - can be reduced by doing the simulation on only the interesting program parts, and perhaps only on a few iterations of a loop. Without measurement/simulation instrumentation, &valgrind;'s usage only has a slowdown factor in the range of 3 to 5.

133

Also, when only the call graph and call counts

134

are of interest, the cache simulator can be switched off.

135

</para>

147

the run program. By combining these miss counts, using miss latencies from typical processors, an estimation of spent time can be given.

148

</para>

149

150

<para>Callgrind is an extension of &cachegrind; that

150

<para>&callgrind; is an extension of &cachegrind; that

151

builds up the call graph of a program on-the-fly,

152

&ie; how the functions call each other and how many events happen

153

while running a function. Also, the profile data to be collected

162

163

<para>

164

Profiling tools typically produce a large amount of data. The wish

165

to easily browse down and up the call graph, together with fast switching of the sorting mode of functions and display of different event types, motivates a GUI application to accomplish this task.

165

to easily browse down and up the call graph, together with fast switching of the sorting mode of functions and display of different event types, motivates a &GUI; application to accomplish this task.

166

</para>

167

168

<para>

169

&kappname; is an visualization for profile data fulfilling these wishes. Despite being programmed first with browsing the data from &cachegrind; and &calltree; in mind, there are converters available to be able to display profile data produced by other tools. In the appendix, a description of the Cachegrind/Callgrind file format is given.

169

170

</para>

171

172

<para>

247

&oprofile; is available from

248

249

http://oprofile.sf.net</ulink>. Follow the installation instructions on the web site;

250

but, before you do, check if your distribution does not already provide it as package (like SuSE).

250

but, before you do, check if your distribution does not already provide it as package (like &SuSE;).

251

</para>

252

253

<para>

254

System-wide profiling is only permitted to the root user,

255

as all actions on the system can be observed; therefore, the following has to be done as root.

256

First, configure the profiling process, using the GUI <command>oprof_start</command> or the

256

First, configure the profiling process, using the &GUI; <command>oprof_start</command> or the

257

command-line tool opcontrol. Standard configuration should be timer mode (TBS, see introduction). To start the measurement, run

258

<command>opcontrol -s</command>. Then run the application you are interested in and, afterwards, do a <command>opcontrol -d</command>. This will write out

259

the measurement results into files under directory <filename>/var/lib/oprofile/samples/</filename>. To be able to visualize the data in &kcachegrind;, do in an empty directory:

296

</para>

297

298

<para>

299

To explore the GUI further, in addition to this manual, also have a look at the documentation section on the web site

299

To explore the &GUI; further, in addition to this manual, also have a look at the documentation section on the web site

300

301

http://kcachegrind.sf.net</ulink>.

302

Also,

325

</para>

326

327

<para>

328

The cost entities known to KCachegrind are given in the following.

328

The cost entities known to &kcachegrind; are given in the following.

329

Simple Positions:

330

331

337

</para></listitem>

338

339

Function.

340

All source lines of a given function make up the function itself. A function is specified by its name and its location in some binary object if available. The latter is needed because binary objects of a single program each can hold functions with the same name (these can be accessed e.g. with dlopen/dlsym; the runtime linker resolves functions in a given search order of binary objects used). If a profiling tool can not detect the symbol name of a function, e.g. because debug information is not available, either the address of the first executed instruction typically is used, or "???".

340

All source lines of a given function make up the function itself. A function is specified by its name and its location in some binary object if available. The latter is needed because binary objects of a single program each can hold functions with the same name (these can be accessed &eg; with dlopen/dlsym; the runtime linker resolves functions in a given search order of binary objects used). If a profiling tool cannot detect the symbol name of a function, &eg; because debug information is not available, either the address of the first executed instruction typically is used, or "???".

341

</para></listitem>

342

343

Binary Object.

349

</para></listitem>

350

351

Class.

352

Symbol names of functions typically are hierarchically ordered in name spaces, e.g. C++ namespaces, or classes of object oriented languages; thus, a class can hold functions of the class or embedded classes itself.

352

Symbol names of functions typically are hierarchically ordered in name spaces, &eg; C++ namespaces, or classes of object oriented languages; thus, a class can hold functions of the class or embedded classes itself.

353

</para></listitem>

354

355

Profile Part.

404

<title>Visualization State</title>

405

406

<para>

407

The Visualization state of a KCachegrind window includes:

407

The Visualization state of a &kcachegrind; window includes:

408

409

410

the primary and secondary event type chosen for display,

416

the profile parts whose costs are to be included in visualization,

417

</para></listitem>

418

419

an active cost entity (e.g. a function selected from the function profile dockable),

419

an active cost entity (&eg; a function selected from the function profile dockable),

420

</para></listitem>

421

422

a selected cost entity.

426

This state influences visualizations.

427

</para>

428

<para>

429

Visualizations always are shown for one, the active, cost entity. When a given visualization is not appropriate for a cost entity, it is disabled (e.g. when selecting an ELF object in the group list by double-clicking, source annotation for an ELF object make no sense).

429

Visualizations always are shown for one, the active, cost entity. When a given visualization is not appropriate for a cost entity, it is disabled (&eg; when selecting an &ELF; object in the group list by double-clicking, source annotation for an &ELF; object make no sense).

430

</para>

431

<para>

432

For example, for an active function, the callee list shows all the functions called from the active one: one can select one of these functions without making it active; also, if the call-graph is shown nearside, it will automatically select the same function.

435

</sect1>

436

437

438

<title>Parts of the GUI</title>

438

<title>Parts of the &GUI;</title>

439

440

<sect2>

441

<title>Sidedocks</title>

442

<para>

443

Sidedocks (Dockables) are side windows which can be placed at any border of an KCachegrind window. They always contain a list of cost entities sorted in some manner.

443

Sidedocks (Dockables) are side windows which can be placed at any border of an &kcachegrind; window. They always contain a list of cost entities sorted in some manner.

444

445

446

Function Profile.

459

<sect2>

460

<title>Visualization Area</title>

461

<para>

462

The visualization area, typically the right part of a KCachegrind main window, is made up of one (default) or more Tab Views, either lined up horizontally or vertically. Each tab view holds different visualization views of only one cost entity at a time. The name of this entity is given at the top of the tab view. If there are multiple tab views, only one is active. The entity name in the active tab view is shown in bold and determines the active cost entity of the KCachegrind window.

462

The visualization area, typically the right part of a &kcachegrind; main window, is made up of one (default) or more Tab Views, either lined up horizontally or vertically. Each tab view holds different visualization views of only one cost entity at a time. The name of this entity is given at the top of the tab view. If there are multiple tab views, only one is active. The entity name in the active tab view is shown in bold and determines the active cost entity of the &kcachegrind; window.

463

</para>

464

</sect2>

465

487

<sect2>

488

<title>Layouts</title>

489

<para>

490

The layout of all the tab views of a window can be saved (see menu item View/Layout). After duplicating the current layout (Ctrl+Plus or menu) and changing some sizes or moving a visualization view to another area of an tab view, you can quickly switch between the old and the new layout via Ctrl+Left/Right. The set of layouts will be stored between KCachegrind sessions of the same profiled command. You can make the current set of layouts as the default one for new KCachegrind sessions, or restore to the default layout set.

490

The layout of all the tab views of a window can be saved (see menu item View/Layout). After duplicating the current layout (Ctrl+Plus or menu) and changing some sizes or moving a visualization view to another area of an tab view, you can quickly switch between the old and the new layout via Ctrl+Left/Right. The set of layouts will be stored between &kcachegrind; sessions of the same profiled command. You can make the current set of layouts as the default one for new &kcachegrind; sessions, or restore to the default layout set.

491

</para>

492

</sect2>

493

</sect1>

501

The flat profile contains a group list and a function list. The group list contains all groups where cost is spent in, depending on the chosen group type. The group list is hidden when grouping is switched off.

502

</para>

503

<para>

504

The function list contains the functions of the selected group (or all functions if grouping is switched off), ordered by some column, e.g. inclusive or self costs spent therein. There is a maximal number of functions shown in the list, which is configurable in Settings/Configure KCachegrind.

504

The function list contains the functions of the selected group (or all functions if grouping is switched off), ordered by some column, &eg; inclusive or self costs spent therein. There is a maximal number of functions shown in the list, which is configurable in Settings/Configure &kcachegrind;.

505

</para>

506

</sect2>

507

508

<sect2>

509

<title>Parts Overview</title>

510

<para>

511

In a profile run, multiple profile data files can be produced, which can be loaded together into KCachegrind. The Parts Overview dockable shows these, horizontally ordered according creation time; the rectangle sizes are proportional to the cost spent in the parts. You can select one or several parts to constrain the costs shown in the other KCachegrind views to these parts only.

511

In a profile run, multiple profile data files can be produced, which can be loaded together into &kcachegrind;. The Parts Overview dockable shows these, horizontally ordered according creation time; the rectangle sizes are proportional to the cost spent in the parts. You can select one or several parts to constrain the costs shown in the other &kcachegrind; views to these parts only.

512

</para>

513

<para>

514

The parts are further subdivided: there is a partitioning and an inclusive cost split mode:

515

516

517

Partitioning: You see the partitioning into groups for a profile data part, according to the group type selected. For example, if ELF object groups are selected, you see colored rectangles for each used ELF object (shared library or executable), sized according to the cost spent therein.

517

Partitioning: You see the partitioning into groups for a profile data part, according to the group type selected. For example, if &ELF; object groups are selected, you see colored rectangles for each used &ELF; object (shared library or executable), sized according to the cost spent therein.

518

</para></listitem>

519

520

Inclusive Cost Split: A rectangle showing the inclusive cost of the current active function in the part is shown. This, again, is split up to show inclusive costs of its callees.

543

This list shows all cost types available and the corresponding self and inclusive cost of the current active function for that event type.

544

</para>

545

<para>

546

By choosing an event type from the list, you change the type of costs shown all over KCachegrind to be the selected one.

546

By choosing an event type from the list, you change the type of costs shown all over &kcachegrind; to be the selected one.

547

</para>

548

</sect2>

549

591

<title>Call Graph</title>

592

<para>

593

This view shows the call graph around the active function.

594

The shown cost is only the cost which is spent while the active function was actually running; i.e. the cost shown for main() - if it's visible - should be the same as the cost of the active function, as that is the part of inclusive cost of main() spent while the active function was running.

594

The shown cost is only the cost which is spent while the active function was actually running; &ie; the cost shown for main() - if it's visible - should be the same as the cost of the active function, as that is the part of inclusive cost of main() spent while the active function was running.

595

</para>

596

<para>

597

For cycles, blue call arrows indicate that this is an artificial call added for correct drawing which actually never happened.

620

621

622

<title>The main &kcachegrind; window</title>

623

624

623

625

624

<sect2>

626

625

797

796

798

797

799

798

800

<para>The toolbar/menubar of my KCachegrind looks so spartanic. Is this normal?</para>

799

<para>The toolbar/menubar of my &kcachegrind; looks so spartanic. Is this normal?</para>

801

800

</question>

802

801

803

802

<para>

804

Obviously KCachegrind is wrongly installed on your system. It is recommended to compile it with the installation prefix to be your system wide KDE base directory like <command>configure --prefix=/opt/kde3; make install</command>.

805

If you choose another directory like $HOME/kde, you should set the environment variable KDEDIR to this directory before running KCachegrind.

803

Obviously &kcachegrind; is wrongly installed on your system. It is recommended to compile it with the installation prefix to be your system wide &kde; base directory like <command>configure --prefix=/opt/kde3; make install</command>.

804

If you choose another directory like $HOME/kde, you should set the environment variable KDEDIR to this directory before running &kcachegrind;.

806

805

</para>

807

806

</answer>

808

807

</qandaentry>

852

851

Profile Project: A configuration for profile experiments used for one program which has to be profiled, perhaps in multiple versions. Comparisons of profile data typically only makes sense between profile data produced in experiments of one profile project.

853

852

</para></listitem>

854

853

855

Cost Entity: An abstract item related to source code to which event counts can be attributed. Dimensions for cost entities are code location (e.g. source line, function), data location (e.g. accessed data type, data object), execution location (e.g. thread, process), and tuples or triples of the aforementioned positions (e.g. calls, object access from statement, evicted data from cache).

854

Cost Entity: An abstract item related to source code to which event counts can be attributed. Dimensions for cost entities are code location (&eg; source line, function), data location (&eg; accessed data type, data object), execution location (&eg; thread, process), and tuples or triples of the aforementioned positions (&eg; calls, object access from statement, evicted data from cache).

856

855

</para></listitem>

857

856

858

857

Event Type: The kind of event of which costs can be attributed to a cost entity. There exist real event types and inherited event types.

881

880

<para>

882

881

Thanks to Julian Seward for his excellent &valgrind;, and Nicholas

883

882

Nethercote for the &cachegrind; addition. Without these programs,

884

<application>KCachegrind</application> would not exist. Some ideas

883

&kcachegrind; would not exist. Some ideas

885

884

for this &GUI; were from them, too.

886

885

</para>

887

886

<para>

889

888

users.

890

889

</para>

891

890

891

892

&underFDL;

893

894

</chapter>

Older »