1
<html xmlns:cf="http://docbook.sourceforge.net/xmlns/chunkfast/1.0">
3
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
4
<title>1.�The Design and Implementation of Valgrind</title>
5
<link rel="stylesheet" href="vg_basic.css" type="text/css">
6
<meta name="generator" content="DocBook XSL Stylesheets V1.69.0">
7
<link rel="start" href="index.html" title="Valgrind Documentation">
8
<link rel="up" href="tech-docs.html" title="Valgrind Technical Documentation">
9
<link rel="prev" href="tech-docs.html" title="Valgrind Technical Documentation">
10
<link rel="next" href="cg-tech-docs.html" title="2.�How Cachegrind works">
12
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
13
<div><table class="nav" width="100%" cellspacing="3" cellpadding="3" border="0" summary="Navigation header"><tr>
14
<td width="22px" align="center" valign="middle"><a accesskey="p" href="tech-docs.html"><img src="images/prev.png" width="18" height="21" border="0" alt="Prev"></a></td>
15
<td width="25px" align="center" valign="middle"><a accesskey="u" href="tech-docs.html"><img src="images/up.png" width="21" height="18" border="0" alt="Up"></a></td>
16
<td width="31px" align="center" valign="middle"><a accesskey="h" href="index.html"><img src="images/home.png" width="27" height="20" border="0" alt="Up"></a></td>
17
<th align="center" valign="middle">Valgrind Technical Documentation</th>
18
<td width="22px" align="center" valign="middle"><a accesskey="n" href="cg-tech-docs.html"><img src="images/next.png" width="18" height="21" border="0" alt="Next"></a></td>
20
<div class="chapter" lang="en">
21
<div class="titlepage"><div>
22
<div><h2 class="title">
23
<a name="mc-tech-docs"></a>1.�The Design and Implementation of Valgrind</h2></div>
24
<div><h3 class="subtitle"><i>Detailed technical notes for hackers, maintainers and
25
the overly-curious</i></h3></div>
28
<p><b>Table of Contents</b></p>
30
<dt><span class="sect1"><a href="mc-tech-docs.html#mc-tech-docs.intro">1.1. Introduction</a></span></dt>
32
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.history">1.1.1. History</a></span></dt>
33
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.overview">1.1.2. Design overview</a></span></dt>
34
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.design">1.1.3. Design decisions</a></span></dt>
35
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.correctness">1.1.4. Correctness</a></span></dt>
36
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.limits">1.1.5. Current limitations</a></span></dt>
38
<dt><span class="sect1"><a href="mc-tech-docs.html#mc-tech-docs.jitter">1.2. The instrumenting JITter</a></span></dt>
40
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.storage">1.2.1. Run-time storage, and the use of host registers</a></span></dt>
41
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.startup">1.2.2. Startup, shutdown, and system calls</a></span></dt>
42
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.ucode">1.2.3. Introduction to UCode</a></span></dt>
43
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.tags">1.2.4. UCode operand tags: type <code class="computeroutput">Tag</code></a></span></dt>
44
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.uinstr">1.2.5. UCode instructions: type <code class="computeroutput">UInstr</code></a></span></dt>
45
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.trans">1.2.6. Translation into UCode</a></span></dt>
46
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.optim">1.2.7. UCode optimisation</a></span></dt>
47
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.instrum">1.2.8. UCode instrumentation</a></span></dt>
48
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.cleanup">1.2.9. UCode post-instrumentation cleanup</a></span></dt>
49
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.transfrom">1.2.10. Translation from UCode</a></span></dt>
50
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.dispatch">1.2.11. Top-level dispatch loop</a></span></dt>
51
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.lazy">1.2.12. Lazy updates of the simulated program counter</a></span></dt>
52
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.signals">1.2.13. Signals</a></span></dt>
53
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.todo">1.2.14. To be written</a></span></dt>
55
<dt><span class="sect1"><a href="mc-tech-docs.html#mc-tech-docs.extensions">1.3. Extensions</a></span></dt>
57
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.bugs">1.3.1. Bugs</a></span></dt>
58
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.threads">1.3.2. Threads</a></span></dt>
59
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.verify">1.3.3. Verification suite</a></span></dt>
60
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.porting">1.3.4. Porting to other platforms</a></span></dt>
62
<dt><span class="sect1"><a href="mc-tech-docs.html#mc-tech-docs.easystuff">1.4. Easy stuff which ought to be done</a></span></dt>
64
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.mmx">1.4.1. MMX Instructions</a></span></dt>
65
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.fixstabs">1.4.2. Fix stabs-info reader</a></span></dt>
66
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.x86instr">1.4.3. BT/BTC/BTS/BTR</a></span></dt>
67
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.prefetch">1.4.4. Using PREFETCH Instructions</a></span></dt>
68
<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.pranges">1.4.5. User-defined Permission Ranges</a></span></dt>
72
<div class="sect1" lang="en">
73
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
74
<a name="mc-tech-docs.intro"></a>1.1.�Introduction</h2></div></div></div>
75
<p>This document contains a detailed, highly-technical description of
76
the internals of Valgrind. This is not the user manual; if you are an
77
end-user of Valgrind, you do not want to read this. Conversely, if you
78
really are a hacker-type and want to know how it works, I assume that
79
you have read the user manual thoroughly.</p>
80
<p>You may need to read this document several times, and carefully.
81
Some important things, I only say once.</p>
82
<p>[Note: this document is now very old, and a lot of its contents
83
are out of date, and misleading.]</p>
84
<div class="sect2" lang="en">
85
<div class="titlepage"><div><div><h3 class="title">
86
<a name="mc-tech-docs.history"></a>1.1.1.�History</h3></div></div></div>
87
<p>Valgrind came into public view in late Feb 2002. However, it has
88
been under contemplation for a very long time, perhaps seriously for
89
about five years. Somewhat over two years ago, I started working on the
90
x86 code generator for the Glasgow Haskell Compiler
91
(http://www.haskell.org/ghc), gaining familiarity with x86 internals on
92
the way. I then did Cacheprof, gaining further x86 experience. Some
93
time around Feb 2000 I started experimenting with a user-space x86
94
interpreter for x86-Linux. This worked, but it was clear that a
95
JIT-based scheme would be necessary to give reasonable performance for
96
Valgrind. Design work for the JITter started in earnest in Oct 2000,
97
and by early 2001 I had an x86-to-x86 dynamic translator which could run
98
quite large programs. This translator was in a sense pointless, since
99
it did not do any instrumentation or checking.</p>
100
<p>Most of the rest of 2001 was taken up designing and implementing
101
the instrumentation scheme. The main difficulty, which consumed a lot
102
of effort, was to design a scheme which did not generate large numbers
103
of false uninitialised-value warnings. By late 2001 a satisfactory
104
scheme had been arrived at, and I started to test it on ever-larger
105
programs, with an eventual eye to making it work well enough so that it
106
was helpful to folks debugging the upcoming version 3 of KDE. I've used
107
KDE since before version 1.0, and wanted to Valgrind to be an indirect
108
contribution to the KDE 3 development effort. At the start of Feb 02
109
the kde-core-devel crew started using it, and gave a huge amount of
110
helpful feedback and patches in the space of three weeks. Snapshot
111
20020306 is the result.</p>
112
<p>In the best Unix tradition, or perhaps in the spirit of Fred
113
Brooks' depressing-but-completely-accurate epitaph "build one to throw
114
away; you will anyway", much of Valgrind is a second or third rendition
115
of the initial idea. The instrumentation machinery
116
(<code class="filename">vg_translate.c</code>, <code class="filename">vg_memory.c</code>)
117
and core CPU simulation (<code class="filename">vg_to_ucode.c</code>,
118
<code class="filename">vg_from_ucode.c</code>) have had three redesigns and
119
rewrites; the register allocator, low-level memory manager
120
(<code class="filename">vg_malloc2.c</code>) and symbol table reader
121
(<code class="filename">vg_symtab2.c</code>) are on the second rewrite. In a
122
sense, this document serves to record some of the knowledge gained as a
125
<div class="sect2" lang="en">
126
<div class="titlepage"><div><div><h3 class="title">
127
<a name="mc-tech-docs.overview"></a>1.1.2.�Design overview</h3></div></div></div>
128
<p>Valgrind is compiled into a Linux shared object,
129
<code class="filename">valgrind.so</code>, and also a dummy one,
130
<code class="filename">valgrinq.so</code>, of which more later. The
131
<code class="filename">valgrind</code> shell script adds
132
<code class="filename">valgrind.so</code> to the
133
<code class="computeroutput">LD_PRELOAD</code> list of extra libraries to
134
be loaded with any dynamically linked library. This is a standard
135
trick, one which I assume the
136
<code class="computeroutput">LD_PRELOAD</code> mechanism was developed to
138
<p><code class="filename">valgrind.so</code> is linked with the
139
<code class="computeroutput">-z initfirst</code> flag, which
140
requests that its initialisation code is run before that of any
141
other object in the executable image. When this happens,
142
valgrind gains control. The real CPU becomes "trapped" in
143
<code class="filename">valgrind.so</code> and the translations it
144
generates. The synthetic CPU provided by Valgrind does, however,
145
return from this initialisation function. So the normal startup
146
actions, orchestrated by the dynamic linker
147
<code class="filename">ld.so</code>, continue as usual, except on the
148
synthetic CPU, not the real one. Eventually
149
<code class="function">main</code> is run and returns, and
150
then the finalisation code of the shared objects is run,
151
presumably in inverse order to which they were initialised.
152
Remember, this is still all happening on the simulated CPU.
153
Eventually <code class="filename">valgrind.so</code>'s own finalisation
154
code is called. It spots this event, shuts down the simulated
155
CPU, prints any error summaries and/or does leak detection, and
156
returns from the initialisation code on the real CPU. At this
157
point, in effect the real and synthetic CPUs have merged back
158
into one, Valgrind has lost control of the program, and the
159
program finally <code class="function">exit()s</code> back to
160
the kernel in the usual way.</p>
161
<p>The normal course of activity, once Valgrind has started
162
up, is as follows. Valgrind never runs any part of your program
163
(usually referred to as the "client"), not a single byte of it,
164
directly. Instead it uses function
165
<code class="function">VG_(translate)</code> to translate
166
basic blocks (BBs, straight-line sequences of code) into
167
instrumented translations, and those are run instead. The
168
translations are stored in the translation cache (TC),
169
<code class="computeroutput">vg_tc</code>, with the translation
170
table (TT), <code class="computeroutput">vg_tt</code> supplying the
171
original-to-translation code address mapping. Auxiliary array
172
<code class="computeroutput">VG_(tt_fast)</code> is used as a
173
direct-map cache for fast lookups in TT; it usually achieves a
174
hit rate of around 98% and facilitates an orig-to-trans lookup in
175
4 x86 insns, which is not bad.</p>
176
<p>Function <code class="function">VG_(dispatch)</code> in
177
<code class="filename">vg_dispatch.S</code> is the heart of the JIT
178
dispatcher. Once a translated code address has been found, it is
179
executed simply by an x86 <code class="computeroutput">call</code>
180
to the translation. At the end of the translation, the next
181
original code addr is loaded into
182
<code class="computeroutput">%eax</code>, and the translation then
183
does a <code class="computeroutput">ret</code>, taking it back to
184
the dispatch loop, with, interestingly, zero branch
185
mispredictions. The address requested in
186
<code class="computeroutput">%eax</code> is looked up first in
187
<code class="function">VG_(tt_fast)</code>, and, if not found,
189
<code class="function">VG_(search_transtab)</code>. If there
190
is still no translation available,
191
<code class="function">VG_(dispatch)</code> exits back to the
192
top-level C dispatcher
193
<code class="function">VG_(toploop)</code>, which arranges for
194
<code class="function">VG_(translate)</code> to make a new
195
translation. All fairly unsurprising, really. There are various
196
complexities described below.</p>
197
<p>The translator, orchestrated by
198
<code class="function">VG_(translate)</code>, is complicated
199
but entirely self-contained. It is described in great detail in
200
subsequent sections. Translations are stored in TC, with TT
201
tracking administrative information. The translations are
202
subject to an approximate LRU-based management scheme. With the
203
current settings, the TC can hold at most about 15MB of
204
translations, and LRU passes prune it to about 13.5MB. Given
205
that the orig-to-translation expansion ratio is about 13:1 to
206
14:1, this means TC holds translations for more or less a
207
megabyte of original code, which generally comes to about 70000
208
basic blocks for C++ compiled with optimisation on. Generating
209
new translations is expensive, so it is worth having a large TC
210
to minimise the (capacity) miss rate.</p>
212
<code class="function">VG_(dispatch)</code>, receives hints
213
from the translations which allow it to cheaply spot all control
214
transfers corresponding to x86
215
<code class="computeroutput">call</code> and
216
<code class="computeroutput">ret</code> instructions. It has to do
217
this in order to spot some special events:</p>
218
<div class="itemizedlist"><ul type="disc">
220
<code class="function">VG_(shutdown)</code>. This is
221
Valgrind's cue to exit. NOTE: actually this is done a
222
different way; it should be cleaned up.</p></li>
223
<li><p>Returns of system call handlers, to the return address
224
<code class="function">VG_(signalreturn_bogusRA)</code>.
225
The signal simulator needs to know when a signal handler is
226
returning, so we spot jumps (returns) to this address.</p></li>
227
<li><p>Calls to <code class="function">vg_trap_here</code>.
228
All <code class="function">malloc</code>,
229
<code class="function">free</code>, etc calls that the
230
client program makes are eventually routed to a call to
231
<code class="function">vg_trap_here</code>, and Valgrind
232
does its own special thing with these calls. In effect this
233
provides a trapdoor, by which Valgrind can intercept certain
234
calls on the simulated CPU, run the call as it sees fit
235
itself (on the real CPU), and return the result to the
236
simulated CPU, quite transparently to the client
239
<p>Valgrind intercepts the client's
240
<code class="function">malloc</code>,
241
<code class="function">free</code>, etc, calls, so that it can
242
store additional information. Each block
243
<code class="function">malloc</code>'d by the client gives
244
rise to a shadow block in which Valgrind stores the call stack at
245
the time of the <code class="function">malloc</code> call.
246
When the client calls <code class="function">free</code>,
247
Valgrind tries to find the shadow block corresponding to the
248
address passed to <code class="function">free</code>, and
249
emits an error message if none can be found. If it is found, the
250
block is placed on the freed blocks queue
251
<code class="computeroutput">vg_freed_list</code>, it is marked as
252
inaccessible, and its shadow block now records the call stack at
253
the time of the <code class="function">free</code> call.
254
Keeping <code class="computeroutput">free</code>'d blocks in this
255
queue allows Valgrind to spot all (presumably invalid) accesses
256
to them. However, once the volume of blocks in the free queue
257
exceeds <code class="function">VG_(clo_freelist_vol)</code>,
258
blocks are finally removed from the queue.</p>
259
<p>Keeping track of <code class="literal">A</code> and
260
<code class="literal">V</code> bits (note: if you don't know what these
261
are, you haven't read the user guide carefully enough) for memory
262
is done in <code class="filename">vg_memory.c</code>. This implements a
263
sparse array structure which covers the entire 4G address space
264
in a way which is reasonably fast and reasonably space efficient.
265
The 4G address space is divided up into 64K sections, each
266
covering 64Kb of address space. Given a 32-bit address, the top
267
16 bits are used to select one of the 65536 entries in
268
<code class="function">VG_(primary_map)</code>. The resulting
269
"secondary" (<code class="computeroutput">SecMap</code>) holds A and
270
V bits for the 64k of address space chunk corresponding to the
271
lower 16 bits of the address.</p>
273
<div class="sect2" lang="en">
274
<div class="titlepage"><div><div><h3 class="title">
275
<a name="mc-tech-docs.design"></a>1.1.3.�Design decisions</h3></div></div></div>
276
<p>Some design decisions were motivated by the need to make
277
Valgrind debuggable. Imagine you are writing a CPU simulator.
278
It works fairly well. However, you run some large program, like
279
Netscape, and after tens of millions of instructions, it crashes.
280
How can you figure out where in your simulator the bug is?</p>
281
<p>Valgrind's answer is: cheat. Valgrind is designed so that
282
it is possible to switch back to running the client program on
283
the real CPU at any point. Using the
284
<code class="option">--stop-after= </code> flag, you can ask
285
Valgrind to run just some number of basic blocks, and then run
286
the rest of the way on the real CPU. If you are searching for a
287
bug in the simulated CPU, you can use this to do a binary search,
288
which quickly leads you to the specific basic block which is
289
causing the problem.</p>
290
<p>This is all very handy. It does constrain the design in
291
certain unimportant ways. Firstly, the layout of memory, when
292
viewed from the client's point of view, must be identical
293
regardless of whether it is running on the real or simulated CPU.
294
This means that Valgrind can't do pointer swizzling -- well, no
295
great loss -- and it can't run on the same stack as the client --
296
again, no great loss. Valgrind operates on its own stack,
297
<code class="function">VG_(stack)</code>, which it switches to
298
at startup, temporarily switching back to the client's stack when
299
doing system calls for the client.</p>
300
<p>Valgrind also receives signals on its own stack,
301
<code class="computeroutput">VG_(sigstack)</code>, but for different
302
gruesome reasons discussed below.</p>
304
switch-back-to-the-real-CPU-whenever-you-like story is muddied by
305
signals. Problem is that signals arrive at arbitrary times and
306
tend to slightly perturb the basic block count, with the result
307
that you can get close to the basic block causing a problem but
308
can't home in on it exactly. My kludgey hack is to define
309
<code class="computeroutput">SIGNAL_SIMULATION</code> to 1 towards
310
the bottom of <code class="filename">vg_syscall_mem.c</code>, so that
311
signal handlers are run on the real CPU and don't change the BB
313
<p>A second hole in the switch-back-to-real-CPU story is that
314
Valgrind's way of delivering signals to the client is different
315
from that of the kernel. Specifically, the layout of the signal
316
delivery frame, and the mechanism used to detect a sighandler
317
returning, are different. So you can't expect to make the
318
transition inside a sighandler and still have things working, but
319
in practice that's not much of a restriction.</p>
320
<p>Valgrind's implementation of
321
<code class="function">malloc</code>,
322
<code class="function">free</code>, etc, (in
323
<code class="filename">vg_clientmalloc.c</code>, not the low-level stuff
324
in <code class="filename">vg_malloc2.c</code>) is somewhat complicated by
325
the need to handle switching back at arbitrary points. It does
328
<div class="sect2" lang="en">
329
<div class="titlepage"><div><div><h3 class="title">
330
<a name="mc-tech-docs.correctness"></a>1.1.4.�Correctness</h3></div></div></div>
331
<p>There's only one of me, and I have a Real Life (tm) as well
332
as hacking Valgrind [allegedly :-]. That means I don't have time
333
to waste chasing endless bugs in Valgrind. My emphasis is
334
therefore on doing everything as simply as possible, with
335
correctness, stability and robustness being the number one
336
priority, more important than performance or functionality. As a
338
<div class="itemizedlist"><ul type="disc">
340
<p>The code is absolutely loaded with assertions, and
341
these are <span><strong class="command">permanently enabled.</strong></span> I have no
342
plan to remove or disable them later. Over the past couple
343
of months, as valgrind has become more widely used, they have
344
shown their worth, pulling up various bugs which would
345
otherwise have appeared as hard-to-find segmentation
347
<p>I am of the view that it's acceptable to spend 5% of
348
the total running time of your valgrindified program doing
349
assertion checks and other internal sanity checks.</p>
352
<p>Aside from the assertions, valgrind contains various
353
sets of internal sanity checks, which get run at varying
354
frequencies during normal operation.
355
<code class="function">VG_(do_sanity_checks)</code> runs
356
every 1000 basic blocks, which means 500 to 2000 times/second
357
for typical machines at present. It checks that Valgrind
358
hasn't overrun its private stack, and does some simple checks
359
on the memory permissions maps. Once every 25 calls it does
360
some more extensive checks on those maps. Etc, etc.</p>
361
<p>The following components also have sanity check code,
362
which can be enabled to aid debugging:</p>
363
<div class="itemizedlist"><ul type="circle">
364
<li><p>The low-level memory-manager
365
(<code class="computeroutput">VG_(mallocSanityCheckArena)</code>).
366
This does a complete check of all blocks and chains in an
367
arena, which is very slow. Is not engaged by default.</p></li>
368
<li><p>The symbol table reader(s): various checks to
369
ensure uniqueness of mappings; see
370
<code class="function">VG_(read_symbols)</code> for a
371
start. Is permanently engaged.</p></li>
372
<li><p>The A and V bit tracking stuff in
373
<code class="filename">vg_memory.c</code>. This can be compiled
375
<code class="computeroutput">VG_DEBUG_MEMORY</code> defined,
376
which removes all the fast, optimised cases, and uses
377
simple-but-slow fallbacks instead. Not engaged by
380
<code class="computeroutput">VG_DEBUG_LEAKCHECK</code>.</p></li>
381
<li><p>The JITter parses x86 basic blocks into sequences
382
of UCode instructions. It then sanity checks each one
383
with <code class="function">VG_(saneUInstr)</code> and
384
sanity checks the sequence as a whole with
385
<code class="function">VG_(saneUCodeBlock)</code>.
386
This stuff is engaged by default, and has caught some
387
way-obscure bugs in the simulated CPU machinery in its
389
<li><p>The system call wrapper does
390
<code class="function">VG_(first_and_last_secondaries_look_plausible)</code>
391
after every syscall; this is known to pick up bugs in the
392
syscall wrappers. Engaged by default.</p></li>
393
<li><p>The main dispatch loop, in
394
<code class="function">VG_(dispatch)</code>, checks
395
that translations do not set
396
<code class="computeroutput">%ebp</code> to any value
398
<code class="computeroutput">VG_EBP_DISPATCH_CHECKED</code>
399
or <code class="computeroutput">& VG_(baseBlock)</code>.
400
In effect this test is free, and is permanently
402
<li><p>There are a couple of ifdefed-out consistency
403
checks I inserted whilst debugging the new register
405
<code class="computeroutput">vg_do_register_allocation</code>.</p></li>
408
<li><p>I try to avoid techniques, algorithms, mechanisms, etc,
409
for which I can supply neither a convincing argument that
410
they are correct, nor sanity-check code which might pick up
411
bugs in my implementation. I don't always succeed in this,
412
but I try. Basically the idea is: avoid techniques which
413
are, in practice, unverifiable, in some sense. When doing
414
anything, always have in mind: "how can I verify that this is
417
<p>Some more specific things are:</p>
418
<div class="itemizedlist"><ul type="disc">
420
<p>Valgrind runs in the same namespace as the client, at
421
least from <code class="filename">ld.so</code>'s point of view, and it
422
therefore absolutely had better not export any symbol with a
423
name which could clash with that of the client or any of its
424
libraries. Therefore, all globally visible symbols exported
425
from <code class="filename">valgrind.so</code> are defined using the
426
<code class="computeroutput">VG_</code> CPP macro. As you'll
427
see from <code class="filename">vg_constants.h</code>, this appends
428
some arbitrary prefix to the symbol, in order that it be, we
429
hope, globally unique. Currently the prefix is
430
<code class="computeroutput">vgPlain_</code>. For convenience
431
there are also <code class="computeroutput">VGM_</code>,
432
<code class="computeroutput">VGP_</code> and
433
<code class="computeroutput">VGOFF_</code>. All locally defined
434
symbols are declared <code class="computeroutput">static</code>
435
and do not appear in the final shared object.</p>
436
<p>To check this, I periodically do <code class="computeroutput">nm
437
valgrind.so | grep " T "</code>, which shows you
438
all the globally exported text symbols. They should all have
439
an approved prefix, except for those like
440
<code class="function">malloc</code>,
441
<code class="function">free</code>, etc, which we
442
deliberately want to shadow and take precedence over the same
443
names exported from <code class="filename">glibc.so</code>, so that
444
valgrind can intercept those calls easily. Similarly,
445
<code class="computeroutput">nm valgrind.so | grep " D "</code>
446
allows you to find any rogue data-segment symbol
450
<p>Valgrind tries, and almost succeeds, in being
451
completely independent of all other shared objects, in
452
particular of <code class="filename">glibc.so</code>. For example, we
453
have our own low-level memory manager in
454
<code class="filename">vg_malloc2.c</code>, which is a fairly standard
455
malloc/free scheme augmented with arenas, and
456
<code class="filename">vg_mylibc.c</code> exports reimplementations of
457
various bits and pieces you'd normally get from the C
459
<p>Why all the hassle? Because imagine the potential
460
chaos of both the simulated and real CPUs executing in
461
<code class="filename">glibc.so</code>. It just seems simpler and
462
cleaner to be completely self-contained, so that only the
463
simulated CPU visits <code class="filename">glibc.so</code>. In
464
practice it's not much hassle anyway. Also, valgrind starts
465
up before glibc has a chance to initialise itself, and who
466
knows what difficulties that could lead to. Finally, glibc
467
has definitions for some types, specifically
468
<code class="computeroutput">sigset_t</code>, which conflict
469
(are different from) the Linux kernel's idea of same. When
470
Valgrind wants to fiddle around with signal stuff, it wants
471
to use the kernel's definitions, not glibc's definitions. So
472
it's simplest just to keep glibc out of the picture
474
<p>To find out which glibc symbols are used by Valgrind,
475
reinstate the link flags <code class="computeroutput">-nostdlib
476
-Wl,-no-undefined</code>. This causes linking to
477
fail, but will tell you what you depend on. I have mostly,
478
but not entirely, got rid of the glibc dependencies; what
479
remains is, IMO, fairly harmless. AFAIK the current
480
dependencies are: <code class="computeroutput">memset</code>,
481
<code class="computeroutput">memcmp</code>,
482
<code class="computeroutput">stat</code>,
483
<code class="computeroutput">system</code>,
484
<code class="computeroutput">sbrk</code>,
485
<code class="computeroutput">setjmp</code> and
486
<code class="computeroutput">longjmp</code>.</p>
489
<p>Similarly, valgrind should not really import any
490
headers other than the Linux kernel headers, since it knows
491
of no API other than the kernel interface to talk to. At the
492
moment this is really not in a good state, and
493
<code class="computeroutput">vg_syscall_mem</code> imports, via
494
<code class="filename">vg_unsafe.h</code>, a significant number of
495
C-library headers so as to know the sizes of various structs
496
passed across the kernel boundary. This is of course
497
completely bogus, since there is no guarantee that the C
498
library's definitions of these structs matches those of the
499
kernel. I have started to sort this out using
500
<code class="filename">vg_kerneliface.h</code>, into which I had
501
intended to copy all kernel definitions which valgrind could
502
need, but this has not gotten very far. At the moment it
503
mostly contains definitions for
504
<code class="computeroutput">sigset_t</code> and
505
<code class="computeroutput">struct sigaction</code>, since the
506
kernel's definition for these really does clash with glibc's.
507
I plan to use a <code class="computeroutput">vki_</code> prefix
508
on all these types and constants, to denote the fact that
509
they pertain to <span><strong class="command">V</strong></span>algrind's
510
<span><strong class="command">K</strong></span>ernel
511
<span><strong class="command">I</strong></span>nterface.</p>
512
<p>Another advantage of having a
513
<code class="filename">vg_kerneliface.h</code> file is that it makes
514
it simpler to interface to a different kernel. Once can, for
515
example, easily imagine writing a new
516
<code class="filename">vg_kerneliface.h</code> for FreeBSD, or x86
521
<div class="sect2" lang="en">
522
<div class="titlepage"><div><div><h3 class="title">
523
<a name="mc-tech-docs.limits"></a>1.1.5.�Current limitations</h3></div></div></div>
524
<p>Support for weird (non-POSIX) signal stuff is patchy. Does
528
<div class="sect1" lang="en">
529
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
530
<a name="mc-tech-docs.jitter"></a>1.2.�The instrumenting JITter</h2></div></div></div>
531
<p>This really is the heart of the matter. We begin with
532
various side issues.</p>
533
<div class="sect2" lang="en">
534
<div class="titlepage"><div><div><h3 class="title">
535
<a name="mc-tech-docs.storage"></a>1.2.1.�Run-time storage, and the use of host registers</h3></div></div></div>
536
<p>Valgrind translates client (original) basic blocks into
537
instrumented basic blocks, which live in the translation cache
538
TC, until either the client finishes or the translations are
539
ejected from TC to make room for newer ones.</p>
540
<p>Since it generates x86 code in memory, Valgrind has
541
complete control of the use of registers in the translations.
542
Now pay attention. I shall say this only once, and it is
543
important you understand this. In what follows I will refer to
544
registers in the host (real) cpu using their standard names,
545
<code class="computeroutput">%eax</code>,
546
<code class="computeroutput">%edi</code>, etc. I refer to registers
547
in the simulated CPU by capitalising them:
548
<code class="computeroutput">%EAX</code>,
549
<code class="computeroutput">%EDI</code>, etc. These two sets of
550
registers usually bear no direct relationship to each other;
551
there is no fixed mapping between them. This naming scheme is
552
used fairly consistently in the comments in the sources.</p>
553
<p>Host registers, once things are up and running, are used as
555
<div class="itemizedlist"><ul type="disc">
556
<li><p><code class="computeroutput">%esp</code>, the real stack
557
pointer, points somewhere in Valgrind's private stack area,
558
<code class="computeroutput">VG_(stack)</code> or, transiently,
559
into its signal delivery stack,
560
<code class="computeroutput">VG_(sigstack)</code>.</p></li>
561
<li><p><code class="computeroutput">%edi</code> is used as a
562
temporary in code generation; it is almost always dead,
563
except when used for the
564
<code class="computeroutput">Left</code> value-tag operations.</p></li>
565
<li><p><code class="computeroutput">%eax</code>,
566
<code class="computeroutput">%ebx</code>,
567
<code class="computeroutput">%ecx</code>,
568
<code class="computeroutput">%edx</code> and
569
<code class="computeroutput">%esi</code> are available to
570
Valgrind's register allocator. They are dead (carry
571
unimportant values) in between translations, and are live
572
only in translations. The one exception to this is
573
<code class="computeroutput">%eax</code>, which, as mentioned
574
far above, has a special significance to the dispatch loop
575
<code class="computeroutput">VG_(dispatch)</code>: when a
576
translation returns to the dispatch loop,
577
<code class="computeroutput">%eax</code> is expected to contain
578
the original-code-address of the next translation to run.
579
The register allocator is so good at minimising spill code
580
that using five regs and not having to save/restore
581
<code class="computeroutput">%edi</code> actually gives better
582
code than allocating to <code class="computeroutput">%edi</code>
583
as well, but then having to push/pop it around special
585
<li><p><code class="computeroutput">%ebp</code> points
587
<code class="computeroutput">VG_(baseBlock)</code>. Valgrind's
588
translations are position-independent, partly because this is
589
convenient, but also because translations get moved around in
590
TC as part of the LRUing activity. <span><strong class="command">All</strong></span>
591
static entities which need to be referred to from generated
592
code, whether data or helper functions, are stored starting
593
at <code class="computeroutput">VG_(baseBlock)</code> and are
594
therefore reached by indexing from
595
<code class="computeroutput">%ebp</code>. There is but one
596
exception, which is that by placing the value
597
<code class="computeroutput">VG_EBP_DISPATCH_CHECKED</code> in
598
<code class="computeroutput">%ebp</code> just before a return to
599
the dispatcher, the dispatcher is informed that the next
600
address to run, in <code class="computeroutput">%eax</code>,
601
requires special treatment.</p></li>
602
<li><p>The real machine's FPU state is pretty much
603
unimportant, for reasons which will become obvious. Ditto
604
its <code class="computeroutput">%eflags</code> register.</p></li>
606
<p>The state of the simulated CPU is stored in memory, in
607
<code class="computeroutput">VG_(baseBlock)</code>, which is a block
608
of 200 words IIRC. Recall that
609
<code class="computeroutput">%ebp</code> points permanently at the
610
start of this block. Function
611
<code class="computeroutput">vg_init_baseBlock</code> decides what
612
the offsets of various entities in
613
<code class="computeroutput">VG_(baseBlock)</code> are to be, and
614
allocates word offsets for them. The code generator then emits
615
<code class="computeroutput">%ebp</code> relative addresses to get
616
at those things. The sequence in which entities are allocated
617
has been carefully chosen so that the 32 most popular entities
618
come first, because this means 8-bit offsets can be used in the
620
<p>If I was clever, I could make
621
<code class="computeroutput">%ebp</code> point 32 words along
622
<code class="computeroutput">VG_(baseBlock)</code>, so that I'd have
623
another 32 words of short-form offsets available, but that's just
624
complicated, and it's not important -- the first 32 words take
625
99% (or whatever) of the traffic.</p>
626
<p>Currently, the sequence of stuff in
627
<code class="computeroutput">VG_(baseBlock)</code> is as
629
<div class="itemizedlist"><ul type="disc">
630
<li><p>9 words, holding the simulated integer registers,
631
<code class="computeroutput">%EAX</code>
632
.. <code class="computeroutput">%EDI</code>, and the simulated
633
flags, <code class="computeroutput">%EFLAGS</code>.</p></li>
634
<li><p>Another 9 words, holding the V bit "shadows" for the
635
above 9 regs.</p></li>
636
<li><p>The <span><strong class="command">addresses</strong></span> of various helper
637
routines called from generated code:
638
<code class="computeroutput">VG_(helper_value_check4_fail)</code>,
639
<code class="computeroutput">VG_(helper_value_check0_fail)</code>,
640
which register V-check failures,
641
<code class="computeroutput">VG_(helperc_STOREV4)</code>,
642
<code class="computeroutput">VG_(helperc_STOREV1)</code>,
643
<code class="computeroutput">VG_(helperc_LOADV4)</code>,
644
<code class="computeroutput">VG_(helperc_LOADV1)</code>, which
645
do stores and loads of V bits to/from the sparse array which
646
keeps track of V bits in memory, and
647
<code class="computeroutput">VGM_(handle_esp_assignment)</code>,
648
which messes with memory addressibility resulting from
649
changes in <code class="computeroutput">%ESP</code>.</p></li>
650
<li><p>The simulated <code class="computeroutput">%EIP</code>.</p></li>
651
<li><p>24 spill words, for when the register allocator can't
652
make it work with 5 measly registers.</p></li>
653
<li><p>Addresses of helpers
654
<code class="computeroutput">VG_(helperc_STOREV2)</code>,
655
<code class="computeroutput">VG_(helperc_LOADV2)</code>. These
656
are here because 2-byte loads and stores are relatively rare,
657
so are placed above the magic 32-word offset boundary.</p></li>
658
<li><p>For similar reasons, addresses of helper functions
659
<code class="computeroutput">VGM_(fpu_write_check)</code> and
660
<code class="computeroutput">VGM_(fpu_read_check)</code>, which
661
handle the A/V maps testing and changes required by FPU
662
writes/reads.</p></li>
663
<li><p>Some other boring helper addresses:
664
<code class="computeroutput">VG_(helper_value_check2_fail)</code>
666
<code class="computeroutput">VG_(helper_value_check1_fail)</code>.
667
These are probably never emitted now, and should be
669
<li><p>The entire state of the simulated FPU, which I believe
670
to be 108 bytes long.</p></li>
671
<li><p>Finally, the addresses of various other helper
672
functions in <code class="filename">vg_helpers.S</code>, which deal
673
with rare situations which are tedious or difficult to
674
generate code in-line for.</p></li>
676
<p>As a general rule, the simulated machine's state lives
677
permanently in memory at
678
<code class="computeroutput">VG_(baseBlock)</code>. However, the
679
JITter does some optimisations which allow the simulated integer
680
registers to be cached in real registers over multiple simulated
681
instructions within the same basic block. These are always
682
flushed back into memory at the end of every basic block, so that
683
the in-memory state is up-to-date between basic blocks. (This
684
flushing is implied by the statement above that the real
685
machine's allocatable registers are dead in between simulated
688
<div class="sect2" lang="en">
689
<div class="titlepage"><div><div><h3 class="title">
690
<a name="mc-tech-docs.startup"></a>1.2.2.�Startup, shutdown, and system calls</h3></div></div></div>
691
<p>Getting into of Valgrind
692
(<code class="computeroutput">VG_(startup)</code>, called from
693
<code class="filename">valgrind.so</code>'s initialisation section),
694
really means copying the real CPU's state into
695
<code class="computeroutput">VG_(baseBlock)</code>, and then
696
installing our own stack pointer, etc, into the real CPU, and
697
then starting up the JITter. Exiting valgrind involves copying
698
the simulated state back to the real state.</p>
699
<p>Unfortunately, there's a complication at startup time.
700
Problem is that at the point where we need to take a snapshot of
701
the real CPU's state, the offsets in
702
<code class="computeroutput">VG_(baseBlock)</code> are not set up
703
yet, because to do so would involve disrupting the real machine's
704
state significantly. The way round this is to dump the real
705
machine's state into a temporary, static block of memory,
706
<code class="computeroutput">VG_(m_state_static)</code>. We can
707
then set up the <code class="computeroutput">VG_(baseBlock)</code>
708
offsets at our leisure, and copy into it from
709
<code class="computeroutput">VG_(m_state_static)</code> at some
710
convenient later time. This copying is done by
711
<code class="computeroutput">VG_(copy_m_state_static_to_baseBlock)</code>.</p>
712
<p>On exit, the inverse transformation is (rather
713
unnecessarily) used: stuff in
714
<code class="computeroutput">VG_(baseBlock)</code> is copied to
715
<code class="computeroutput">VG_(m_state_static)</code>, and the
716
assembly stub then copies from
717
<code class="computeroutput">VG_(m_state_static)</code> into the
718
real machine registers.</p>
719
<p>Doing system calls on behalf of the client
720
(<code class="filename">vg_syscall.S</code>) is something of a half-way
721
house. We have to make the world look sufficiently like that
722
which the client would normally have to make the syscall actually
723
work properly, but we can't afford to lose control. So the trick
724
is to copy all of the client's state, <span><strong class="command">except its program
725
counter</strong></span>, into the real CPU, do the system call, and
726
copy the state back out. Note that the client's state includes
727
its stack pointer register, so one effect of this partial
728
restoration is to cause the system call to be run on the client's
729
stack, as it should be.</p>
730
<p>As ever there are complications. We have to save some of
731
our own state somewhere when restoring the client's state into
732
the CPU, so that we can keep going sensibly afterwards. In fact
733
the only thing which is important is our own stack pointer, but
734
for paranoia reasons I save and restore our own FPU state as
735
well, even though that's probably pointless.</p>
736
<p>The complication on the above complication is, that for
737
horrible reasons to do with signals, we may have to handle a
738
second client system call whilst the client is blocked inside
739
some other system call (unbelievable!). That means there's two
740
sets of places to dump Valgrind's stack pointer and FPU state
741
across the syscall, and we decide which to use by consulting
742
<code class="computeroutput">VG_(syscall_depth)</code>, which is in
744
<code class="computeroutput">VG_(wrap_syscall)</code>.</p>
746
<div class="sect2" lang="en">
747
<div class="titlepage"><div><div><h3 class="title">
748
<a name="mc-tech-docs.ucode"></a>1.2.3.�Introduction to UCode</h3></div></div></div>
749
<p>UCode lies at the heart of the x86-to-x86 JITter. The
750
basic premise is that dealing the the x86 instruction set head-on
751
is just too darn complicated, so we do the traditional
752
compiler-writer's trick and translate it into a simpler,
753
easier-to-deal-with form.</p>
754
<p>In normal operation, translation proceeds through six
755
stages, coordinated by
756
<code class="computeroutput">VG_(translate)</code>:</p>
757
<div class="orderedlist"><ol type="1">
758
<li><p>Parsing of an x86 basic block into a sequence of UCode
759
instructions (<code class="computeroutput">VG_(disBB)</code>).</p></li>
760
<li><p>UCode optimisation
761
(<code class="computeroutput">vg_improve</code>), with the aim
762
of caching simulated registers in real registers over
763
multiple simulated instructions, and removing redundant
764
simulated <code class="computeroutput">%EFLAGS</code>
765
saving/restoring.</p></li>
766
<li><p>UCode instrumentation
767
(<code class="computeroutput">vg_instrument</code>), which adds
768
value and address checking code.</p></li>
769
<li><p>Post-instrumentation cleanup
770
(<code class="computeroutput">vg_cleanup</code>), removing
771
redundant value-check computations.</p></li>
772
<li><p>Register allocation
773
(<code class="computeroutput">vg_do_register_allocation</code>),
774
which, note, is done on UCode.</p></li>
775
<li><p>Emission of final instrumented x86 code
776
(<code class="computeroutput">VG_(emit_code)</code>).</p></li>
778
<p>Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode
779
transformation passes, all on straight-line blocks of UCode (type
780
<code class="computeroutput">UCodeBlock</code>). Steps 2 and 4 are
781
optimisation passes and can be disabled for debugging purposes,
782
with <code class="option">--optimise=no</code> and
783
<code class="option">--cleanup=no</code> respectively.</p>
784
<p>Valgrind can also run in a no-instrumentation mode, given
785
<code class="option">--instrument=no</code>. This is useful
786
for debugging the JITter quickly without having to deal with the
787
complexity of the instrumentation mechanism too. In this mode,
788
steps 3 and 4 are omitted.</p>
789
<p>These flags combine, so that
790
<code class="option">--instrument=no</code> together with
791
<code class="option">--optimise=no</code> means only steps
793
<code class="option">--single-step=yes</code> causes each
794
x86 instruction to be treated as a single basic block. The
795
translations are terrible but this is sometimes instructive.</p>
796
<p>The <code class="option">--stop-after=N</code> flag
797
switches back to the real CPU after
798
<code class="computeroutput">N</code> basic blocks. It also re-JITs
799
the final basic block executed and prints the debugging info
800
resulting, so this gives you a way to get a quick snapshot of how
801
a basic block looks as it passes through the six stages mentioned
802
above. If you want to see full information for every block
803
translated (probably not, but still ...) find, in
804
<code class="computeroutput">VG_(translate)</code>, the lines</p>
805
<pre class="programlisting">
807
dis = debugging_translation;</pre>
808
<p>and comment out the second line. This will spew out
809
debugging junk faster than you can possibly imagine.</p>
811
<div class="sect2" lang="en">
812
<div class="titlepage"><div><div><h3 class="title">
813
<a name="mc-tech-docs.tags"></a>1.2.4.�UCode operand tags: type <code class="computeroutput">Tag</code></h3></div></div></div>
814
<p>UCode is, more or less, a simple two-address RISC-like
815
code. In keeping with the x86 AT&T assembly syntax,
816
generally speaking the first operand is the source operand, and
817
the second is the destination operand, which is modified when the
818
uinstr is notionally executed.</p>
819
<p>UCode instructions have up to three operand fields, each of
820
which has a corresponding <code class="computeroutput">Tag</code>
821
describing it. Possible values for the tag are:</p>
822
<div class="itemizedlist"><ul type="disc">
823
<li><p><code class="computeroutput">NoValue</code>: indicates
824
that the field is not in use.</p></li>
825
<li><p><code class="computeroutput">Lit16</code>: the field
826
contains a 16-bit literal.</p></li>
827
<li><p><code class="computeroutput">Literal</code>: the field
828
denotes a 32-bit literal, whose value is stored in the
829
<code class="computeroutput">lit32</code> field of the uinstr
830
itself. Since there is only one
831
<code class="computeroutput">lit32</code> for the whole uinstr,
832
only one operand field may contain this tag.</p></li>
833
<li><p><code class="computeroutput">SpillNo</code>: the field
834
contains a spill slot number, in the range 0 to 23 inclusive,
835
denoting one of the spill slots contained inside
836
<code class="computeroutput">VG_(baseBlock)</code>. Such tags
837
only exist after register allocation.</p></li>
838
<li><p><code class="computeroutput">RealReg</code>: the field
839
contains a number in the range 0 to 7 denoting an integer x86
840
("real") register on the host. The number is the Intel
841
encoding for integer registers. Such tags only exist after
842
register allocation.</p></li>
843
<li><p><code class="computeroutput">ArchReg</code>: the field
844
contains a number in the range 0 to 7 denoting an integer x86
845
register on the simulated CPU. In reality this means a
846
reference to one of the first 8 words of
847
<code class="computeroutput">VG_(baseBlock)</code>. Such tags
848
can exist at any point in the translation process.</p></li>
850
<p>Last, but not least,
851
<code class="computeroutput">TempReg</code>. The field contains
852
the number of one of an infinite set of virtual (integer)
853
registers. <code class="computeroutput">TempReg</code>s are used
854
everywhere throughout the translation process; you can have
855
as many as you want. The register allocator maps as many as
856
it can into <code class="computeroutput">RealReg</code>s and
858
<code class="computeroutput">SpillNo</code>s, so
859
<code class="computeroutput">TempReg</code>s should not exist
860
after the register allocation phase.</p>
861
<p><code class="computeroutput">TempReg</code>s are always 32
862
bits long, even if the data they hold is logically shorter.
863
In that case the upper unused bits are required, and, I
864
think, generally assumed, to be zero.
865
<code class="computeroutput">TempReg</code>s holding V bits for
866
quantities shorter than 32 bits are expected to have ones in
867
the unused places, since a one denotes "undefined".</p>
871
<div class="sect2" lang="en">
872
<div class="titlepage"><div><div><h3 class="title">
873
<a name="mc-tech-docs.uinstr"></a>1.2.5.�UCode instructions: type <code class="computeroutput">UInstr</code></h3></div></div></div>
874
<p>UCode was carefully designed to make it possible to do
875
register allocation on UCode and then translate the result into
876
x86 code without needing any extra registers ... well, that was
877
the original plan, anyway. Things have gotten a little more
878
complicated since then. In what follows, UCode instructions are
879
referred to as uinstrs, to distinguish them from x86
880
instructions. Uinstrs of course have uopcodes which are
881
(naturally) different from x86 opcodes.</p>
882
<p>A uinstr (type <code class="computeroutput">UInstr</code>)
883
contains various fields, not all of which are used by any one
885
<div class="itemizedlist"><ul type="disc">
886
<li><p>Three 16-bit operand fields,
887
<code class="computeroutput">val1</code>,
888
<code class="computeroutput">val2</code> and
889
<code class="computeroutput">val3</code>.</p></li>
890
<li><p>Three tag fields,
891
<code class="computeroutput">tag1</code>,
892
<code class="computeroutput">tag2</code> and
893
<code class="computeroutput">tag3</code>. Each of these has a
894
value of type <code class="computeroutput">Tag</code>, and they
895
describe what the <code class="computeroutput">val1</code>,
896
<code class="computeroutput">val2</code> and
897
<code class="computeroutput">val3</code> fields contain.</p></li>
898
<li><p>A 32-bit literal field.</p></li>
899
<li><p>Two <code class="computeroutput">FlagSet</code>s,
900
specifying which x86 condition codes are read and written by
902
<li><p>An opcode byte, containing a value of type
903
<code class="computeroutput">Opcode</code>.</p></li>
904
<li><p>A size field, indicating the data transfer size
905
(1/2/4/8/10) in cases where this makes sense, or zero
907
<li><p>A condition-code field, which, for jumps, holds a value
908
of type <code class="computeroutput">Condcode</code>, indicating
909
the condition which applies. The encoding is as it is in the
910
x86 insn stream, except we add a 17th value
911
<code class="computeroutput">CondAlways</code> to indicate an
912
unconditional transfer.</p></li>
913
<li><p>Various 1-bit flags, indicating whether this insn
914
pertains to an x86 CALL or RET instruction, whether a
915
widening is signed or not, etc.</p></li>
917
<p>UOpcodes (type <code class="computeroutput">Opcode</code>) are
918
divided into two groups: those necessary merely to express the
919
functionality of the x86 code, and extra uopcodes needed to
920
express the instrumentation. The former group contains:</p>
921
<div class="itemizedlist"><ul type="disc">
922
<li><p><code class="computeroutput">GET</code> and
923
<code class="computeroutput">PUT</code>, which move values from
924
the simulated CPU's integer registers
925
(<code class="computeroutput">ArchReg</code>s) into
926
<code class="computeroutput">TempReg</code>s, and back.
927
<code class="computeroutput">GETF</code> and
928
<code class="computeroutput">PUTF</code> do the corresponding
929
thing for the simulated
930
<code class="computeroutput">%EFLAGS</code>. There are no
931
corresponding insns for the FPU register stack, since we
932
don't explicitly simulate its registers.</p></li>
933
<li><p><code class="computeroutput">LOAD</code> and
934
<code class="computeroutput">STORE</code>, which, in RISC-like
935
fashion, are the only uinstrs able to interact with
937
<li><p><code class="computeroutput">MOV</code> and
938
<code class="computeroutput">CMOV</code> allow unconditional and
939
conditional moves of values between
940
<code class="computeroutput">TempReg</code>s.</p></li>
942
<p>ALU operations. Again in RISC-like fashion, these only
943
operate on <code class="computeroutput">TempReg</code>s (before
944
reg-alloc) or <code class="computeroutput">RealReg</code>s
945
(after reg-alloc). These are:
946
<code class="computeroutput">ADD</code>,
947
<code class="computeroutput">ADC</code>,
948
<code class="computeroutput">AND</code>,
949
<code class="computeroutput">OR</code>,
950
<code class="computeroutput">XOR</code>,
951
<code class="computeroutput">SUB</code>,
952
<code class="computeroutput">SBB</code>,
953
<code class="computeroutput">SHL</code>,
954
<code class="computeroutput">SHR</code>,
955
<code class="computeroutput">SAR</code>,
956
<code class="computeroutput">ROL</code>,
957
<code class="computeroutput">ROR</code>,
958
<code class="computeroutput">RCL</code>,
959
<code class="computeroutput">RCR</code>,
960
<code class="computeroutput">NOT</code>,
961
<code class="computeroutput">NEG</code>,
962
<code class="computeroutput">INC</code>,
963
<code class="computeroutput">DEC</code>,
964
<code class="computeroutput">BSWAP</code>,
965
<code class="computeroutput">CC2VAL</code> and
966
<code class="computeroutput">WIDEN</code>.
967
<code class="computeroutput">WIDEN</code> does signed or
968
unsigned value widening.
969
<code class="computeroutput">CC2VAL</code> is used to convert
970
condition codes into a value, zero or one. The rest are
972
<p>To allow for more efficient code generation, we bend
973
slightly the restriction at the start of the previous para:
974
for <code class="computeroutput">ADD</code>,
975
<code class="computeroutput">ADC</code>,
976
<code class="computeroutput">XOR</code>,
977
<code class="computeroutput">SUB</code> and
978
<code class="computeroutput">SBB</code>, we allow the first
979
(source) operand to also be an
980
<code class="computeroutput">ArchReg</code>, that is, one of the
981
simulated machine's registers. Also, many of these ALU ops
982
allow the source operand to be a literal. See
983
<code class="computeroutput">VG_(saneUInstr)</code> for the
984
final word on the allowable forms of uinstrs.</p>
986
<li><p><code class="computeroutput">LEA1</code> and
987
<code class="computeroutput">LEA2</code> are not strictly
988
necessary, but allow faciliate better translations. They
989
record the fancy x86 addressing modes in a direct way, which
990
allows those amodes to be emitted back into the final
991
instruction stream more or less verbatim.</p></li>
993
<p><code class="computeroutput">CALLM</code> calls a
994
machine-code helper, one of the methods whose address is
996
<code class="computeroutput">VG_(baseBlock)</code> offset.
997
<code class="computeroutput">PUSH</code> and
998
<code class="computeroutput">POP</code> move values to/from
999
<code class="computeroutput">TempReg</code> to the real
1000
(Valgrind's) stack, and
1001
<code class="computeroutput">CLEAR</code> removes values from
1002
the stack. <code class="computeroutput">CALLM_S</code> and
1003
<code class="computeroutput">CALLM_E</code> delimit the
1004
boundaries of call setups and clearings, for the benefit of
1005
the instrumentation passes. Getting this right is critical,
1006
and so <code class="computeroutput">VG_(saneUCodeBlock)</code>
1007
makes various checks on the use of these uopcodes.</p>
1008
<p>It is important to understand that these uopcodes have
1009
nothing to do with the x86
1010
<code class="computeroutput">call</code>,
1011
<code class="computeroutput">return,</code>
1012
<code class="computeroutput">push</code> or
1013
<code class="computeroutput">pop</code> instructions, and are
1014
not used to implement them. Those guys turn into
1015
combinations of <code class="computeroutput">GET</code>,
1016
<code class="computeroutput">PUT</code>,
1017
<code class="computeroutput">LOAD</code>,
1018
<code class="computeroutput">STORE</code>,
1019
<code class="computeroutput">ADD</code>,
1020
<code class="computeroutput">SUB</code>, and
1021
<code class="computeroutput">JMP</code>. What these uopcodes
1022
support is calling of helper functions such as
1023
<code class="computeroutput">VG_(helper_imul_32_64)</code>,
1024
which do stuff which is too difficult or tedious to emit
1027
<li><p><code class="computeroutput">FPU</code>,
1028
<code class="computeroutput">FPU_R</code> and
1029
<code class="computeroutput">FPU_W</code>. Valgrind doesn't
1030
attempt to simulate the internal state of the FPU at all.
1031
Consequently it only needs to be able to distinguish FPU ops
1032
which read and write memory from those that don't, and for
1033
those which do, it needs to know the effective address and
1034
data transfer size. This is made easier because the x86 FP
1035
instruction encoding is very regular, basically consisting of
1036
16 bits for a non-memory FPU insn and 11 (IIRC) bits + an
1037
address mode for a memory FPU insn. So our
1038
<code class="computeroutput">FPU</code> uinstr carries the 16
1039
bits in its <code class="computeroutput">val1</code> field. And
1040
<code class="computeroutput">FPU_R</code> and
1041
<code class="computeroutput">FPU_W</code> carry 11 bits in that
1042
field, together with the identity of a
1043
<code class="computeroutput">TempReg</code> or (later)
1044
<code class="computeroutput">RealReg</code> which contains the
1046
<li><p><code class="computeroutput">JIFZ</code> is unique, in
1047
that it allows a control-flow transfer which is not deemed to
1048
end a basic block. It causes a jump to a literal (original)
1049
address if the specified argument is zero.</p></li>
1050
<li><p>Finally, <code class="computeroutput">INCEIP</code>
1051
advances the simulated <code class="computeroutput">%EIP</code>
1052
by the specified literal amount. This supports lazy
1053
<code class="computeroutput">%EIP</code> updating, as described
1056
<p>Stages 1 and 2 of the 6-stage translation process mentioned
1057
above deal purely with these uopcodes, and no others. They are
1058
sufficient to express pretty much all the x86 32-bit
1059
protected-mode instruction set, at least everything understood by
1060
a pre-MMX original Pentium (P54C).</p>
1061
<p>Stages 3, 4, 5 and 6 also deal with the following extra
1062
"instrumentation" uopcodes. They are used to express all the
1063
definedness-tracking and -checking machinery which valgrind does.
1064
In later sections we show how to create checking code for each of
1065
the uopcodes above. Note that these instrumentation uopcodes,
1066
although some appearing complicated, have been carefully chosen
1067
so that efficient x86 code can be generated for them. GNU
1068
superopt v2.5 did a great job helping out here. Anyways, the
1069
uopcodes are as follows:</p>
1070
<div class="itemizedlist"><ul type="disc">
1071
<li><p><code class="computeroutput">GETV</code> and
1072
<code class="computeroutput">PUTV</code> are analogues to
1073
<code class="computeroutput">GET</code> and
1074
<code class="computeroutput">PUT</code> above. They are
1075
identical except that they move the V bits for the specified
1076
values back and forth to
1077
<code class="computeroutput">TempRegs</code>, rather than moving
1078
the values themselves.</p></li>
1079
<li><p>Similarly, <code class="computeroutput">LOADV</code> and
1080
<code class="computeroutput">STOREV</code> read and write V bits
1081
from the synthesised shadow memory that Valgrind maintains.
1082
In fact they do more than that, since they also do
1083
address-validity checks, and emit complaints if the
1084
read/written addresses are unaddressible.</p></li>
1085
<li><p><code class="computeroutput">TESTV</code>, whose
1086
parameters are a <code class="computeroutput">TempReg</code> and
1087
a size, tests the V bits in the
1088
<code class="computeroutput">TempReg</code>, at the specified
1089
operation size (0/1/2/4 byte) and emits an error if any of
1090
them indicate undefinedness. This is the only uopcode
1091
capable of doing such tests.</p></li>
1092
<li><p><code class="computeroutput">SETV</code>, whose parameters
1093
are also <code class="computeroutput">TempReg</code> and a size,
1094
makes the V bits in the
1095
<code class="computeroutput">TempReg</code> indicated
1096
definedness, at the specified operation size. This is
1097
usually used to generate the correct V bits for a literal
1098
value, which is of course fully defined.</p></li>
1099
<li><p><code class="computeroutput">GETVF</code> and
1100
<code class="computeroutput">PUTVF</code> are analogues to
1101
<code class="computeroutput">GETF</code> and
1102
<code class="computeroutput">PUTF</code>. They move the single
1103
V bit used to model definedness of
1104
<code class="computeroutput">%EFLAGS</code> between its home in
1105
<code class="computeroutput">VG_(baseBlock)</code> and the
1106
specified <code class="computeroutput">TempReg</code>.</p></li>
1107
<li><p><code class="computeroutput">TAG1</code> denotes one of a
1108
family of unary operations on
1109
<code class="computeroutput">TempReg</code>s containing V bits.
1110
Similarly, <code class="computeroutput">TAG2</code> denotes one
1111
in a family of binary operations on V bits.</p></li>
1113
<p>These 10 uopcodes are sufficient to express Valgrind's
1114
entire definedness-checking semantics. In fact most of the
1115
interesting magic is done by the
1116
<code class="computeroutput">TAG1</code> and
1117
<code class="computeroutput">TAG2</code> suboperations.</p>
1118
<p>First, however, I need to explain about V-vector operation
1119
sizes. There are 4 sizes: 1, 2 and 4, which operate on groups of
1120
8, 16 and 32 V bits at a time, supporting the usual 1, 2 and 4
1121
byte x86 operations. However there is also the mysterious size
1122
0, which really means a single V bit. Single V bits are used in
1123
various circumstances; in particular, the definedness of
1124
<code class="computeroutput">%EFLAGS</code> is modelled with a
1125
single V bit. Now might be a good time to also point out that
1126
for V bits, 1 means "undefined" and 0 means "defined".
1127
Similarly, for A bits, 1 means "invalid address" and 0 means
1128
"valid address". This seems counterintuitive (and so it is), but
1129
testing against zero on x86s saves instructions compared to
1130
testing against all 1s, because many ALU operations set the Z
1131
flag for free, so to speak.</p>
1132
<p>With that in mind, the tag ops are:</p>
1133
<div class="itemizedlist"><ul type="disc">
1135
<p><b>(UNARY) Pessimising casts:�</b><code class="computeroutput">VgT_PCast40</code>,
1136
<code class="computeroutput">VgT_PCast20</code>,
1137
<code class="computeroutput">VgT_PCast10</code>,
1138
<code class="computeroutput">VgT_PCast01</code>,
1139
<code class="computeroutput">VgT_PCast02</code> and
1140
<code class="computeroutput">VgT_PCast04</code>. A "pessimising
1141
cast" takes a V-bit vector at one size, and creates a new one
1142
at another size, pessimised in the sense that if any of the
1143
bits in the source vector indicate undefinedness, then all
1144
the bits in the result indicate undefinedness. In this case
1145
the casts are all to or from a single V bit, so for example
1146
<code class="computeroutput">VgT_PCast40</code> is a pessimising
1147
cast from 32 bits to 1, whereas
1148
<code class="computeroutput">VgT_PCast04</code> simply copies
1149
the single source V bit into all 32 bit positions in the
1150
result. Surprisingly, these ops can all be implemented very
1152
<p>There are also the pessimising casts
1153
<code class="computeroutput">VgT_PCast14</code>, from 8 bits to
1154
32, <code class="computeroutput">VgT_PCast12</code>, from 8 bits
1155
to 16, and <code class="computeroutput">VgT_PCast11</code>, from
1156
8 bits to 8. This last one seems nonsensical, but in fact it
1157
isn't a no-op because, as mentioned above, any undefined (1)
1158
bits in the source infect the entire result.</p>
1160
<li><p><b>(UNARY) Propagating undefinedness upwards in a
1161
word:�</b><code class="computeroutput">VgT_Left4</code>,
1162
<code class="computeroutput">VgT_Left2</code> and
1163
<code class="computeroutput">VgT_Left1</code>. These are used
1164
to simulate the worst-case effects of carry propagation in
1165
adds and subtracts. They return a V vector identical to the
1166
original, except that if the original contained any undefined
1167
bits, then it and all bits above it are marked as undefined
1168
too. Hence the Left bit in the names.</p></li>
1169
<li><p><b>(UNARY) Signed and unsigned value widening:�</b><code class="computeroutput">VgT_SWiden14</code>,
1170
<code class="computeroutput">VgT_SWiden24</code>,
1171
<code class="computeroutput">VgT_SWiden12</code>,
1172
<code class="computeroutput">VgT_ZWiden14</code>,
1173
<code class="computeroutput">VgT_ZWiden24</code> and
1174
<code class="computeroutput">VgT_ZWiden12</code>. These mimic
1175
the definedness effects of standard signed and unsigned
1176
integer widening. Unsigned widening creates zero bits in the
1178
<code class="computeroutput">VgT_ZWiden*</code> accordingly park
1179
mark those parts of their argument as defined. Signed
1180
widening copies the sign bit into the new positions, so
1181
<code class="computeroutput">VgT_SWiden*</code> copies the
1182
definedness of the sign bit into the new positions. Because
1183
1 means undefined and 0 means defined, these operations can
1184
(fascinatingly) be done by the same operations which they
1185
mimic. Go figure.</p></li>
1186
<li><p><b>(BINARY) Undefined-if-either-Undefined,
1187
Defined-if-either-Defined:�</b><code class="computeroutput">VgT_UifU4</code>,
1188
<code class="computeroutput">VgT_UifU2</code>,
1189
<code class="computeroutput">VgT_UifU1</code>,
1190
<code class="computeroutput">VgT_UifU0</code>,
1191
<code class="computeroutput">VgT_DifD4</code>,
1192
<code class="computeroutput">VgT_DifD2</code>,
1193
<code class="computeroutput">VgT_DifD1</code>. These do simple
1194
bitwise operations on pairs of V-bit vectors, with
1195
<code class="computeroutput">UifU</code> giving undefined if
1196
either arg bit is undefined, and
1197
<code class="computeroutput">DifD</code> giving defined if
1198
either arg bit is defined. Abstract interpretation junkies,
1199
if any make it this far, may like to think of them as meets
1200
and joins (or is it joins and meets) in the definedness
1203
<p><b>(BINARY; one value, one V bits) Generate argument
1204
improvement terms for AND and OR.�</b><code class="computeroutput">VgT_ImproveAND4_TQ</code>,
1205
<code class="computeroutput">VgT_ImproveAND2_TQ</code>,
1206
<code class="computeroutput">VgT_ImproveAND1_TQ</code>,
1207
<code class="computeroutput">VgT_ImproveOR4_TQ</code>,
1208
<code class="computeroutput">VgT_ImproveOR2_TQ</code>,
1209
<code class="computeroutput">VgT_ImproveOR1_TQ</code>. These
1210
help out with AND and OR operations. AND and OR have the
1211
inconvenient property that the definedness of the result
1212
depends on the actual values of the arguments as well as
1213
their definedness. At the bit level:</p>
1214
<pre class="programlisting">
1215
1 AND undefined = undefined, but
1216
0 AND undefined = 0, and
1218
0 OR undefined = undefined, but
1219
1 OR undefined = 1.</pre>
1220
<p>It turns out that gcc (quite legitimately) generates
1221
code which relies on this fact, so we have to model it
1222
properly in order to avoid flooding users with spurious value
1223
errors. The ultimate definedness result of AND and OR is
1224
calculated using <code class="computeroutput">UifU</code> on the
1225
definedness of the arguments, but we also
1226
<code class="computeroutput">DifD</code> in some "improvement"
1227
terms which take into account the above phenomena.</p>
1228
<p><code class="computeroutput">ImproveAND</code> takes as
1229
its first argument the actual value of an argument to AND
1230
(the T) and the definedness of that argument (the Q), and
1231
returns a V-bit vector which is defined (0) for bits which
1232
have value 0 and are defined; this, when
1233
<code class="computeroutput">DifD</code> into the final result
1234
causes those bits to be defined even if the corresponding bit
1235
in the other argument is undefined.</p>
1236
<p>The <code class="computeroutput">ImproveOR</code> ops do
1237
the dual thing for OR arguments. Note that XOR does not have
1238
this property that one argument can make the other
1239
irrelevant, so there is no need for such complexity for
1243
<p>That's all the tag ops. If you stare at this long enough,
1244
and then run Valgrind and stare at the pre- and post-instrumented
1245
ucode, it should be fairly obvious how the instrumentation
1246
machinery hangs together.</p>
1247
<p>One point, if you do this: in order to make it easy to
1248
differentiate <code class="computeroutput">TempReg</code>s carrying
1249
values from <code class="computeroutput">TempReg</code>s carrying V
1250
bit vectors, Valgrind prints the former as (for example)
1251
<code class="computeroutput">t28</code> and the latter as
1252
<code class="computeroutput">q28</code>; the fact that they carry
1253
the same number serves to indicate their relationship. This is
1254
purely for the convenience of the human reader; the register
1255
allocator and code generator don't regard them as
1258
<div class="sect2" lang="en">
1259
<div class="titlepage"><div><div><h3 class="title">
1260
<a name="mc-tech-docs.trans"></a>1.2.6.�Translation into UCode</h3></div></div></div>
1261
<p><code class="computeroutput">VG_(disBB)</code> allocates a new
1262
<code class="computeroutput">UCodeBlock</code> and then uses
1263
<code class="computeroutput">disInstr</code> to translate x86
1264
instructions one at a time into UCode, dumping the result in the
1265
<code class="computeroutput">UCodeBlock</code>. This goes on until
1266
a control-flow transfer instruction is encountered.</p>
1267
<p>Despite the large size of
1268
<code class="filename">vg_to_ucode.c</code>, this translation is really
1269
very simple. Each x86 instruction is translated entirely
1270
independently of its neighbours, merrily allocating new
1271
<code class="computeroutput">TempReg</code>s as it goes. The idea
1272
is to have a simple translator -- in reality, no more than a
1273
macro-expander -- and the -- resulting bad UCode translation is
1274
cleaned up by the UCode optimisation phase which follows. To
1275
give you an idea of some x86 instructions and their translations
1276
(this is a complete basic block, as Valgrind sees it):</p>
1277
<pre class="programlisting">
1278
0x40435A50: incl %edx
1280
1: INCL t0 (-wOSZAP)
1283
0x40435A51: movsbl (%edx),%eax
1289
0x40435A54: testb $0x20, 1(%ecx,%eax,2)
1292
9: LEA2L 1(t8,t6,2), t4
1295
12: ANDB t12, t10 (-wOSZACP)
1298
0x40435A59: jnz-8 0x40435A50
1299
14: Jnzo $0x40435A50 (-rOSZACP)
1300
15: JMPo $0x40435A5B</pre>
1301
<p>Notice how the block always ends with an unconditional jump
1302
to the next block. This is a bit unnecessary, but makes many
1304
<p>Most x86 instructions turn into sequences of
1305
<code class="computeroutput">GET</code>,
1306
<code class="computeroutput">PUT</code>,
1307
<code class="computeroutput">LEA1</code>,
1308
<code class="computeroutput">LEA2</code>,
1309
<code class="computeroutput">LOAD</code> and
1310
<code class="computeroutput">STORE</code>. Some complicated ones
1311
however rely on calling helper bits of code in
1312
<code class="filename">vg_helpers.S</code>. The ucode instructions
1313
<code class="computeroutput">PUSH</code>,
1314
<code class="computeroutput">POP</code>,
1315
<code class="computeroutput">CALL</code>,
1316
<code class="computeroutput">CALLM_S</code> and
1317
<code class="computeroutput">CALLM_E</code> support this. The
1318
calling convention is somewhat ad-hoc and is not the C calling
1319
convention. The helper routines must save all integer registers,
1320
and the flags, that they use. Args are passed on the stack
1321
underneath the return address, as usual, and if result(s) are to
1322
be returned, it (they) are either placed in dummy arg slots
1323
created by the ucode <code class="computeroutput">PUSH</code>
1324
sequence, or just overwrite the incoming args.</p>
1325
<p>In order that the instrumentation mechanism can handle
1326
calls to these helpers,
1327
<code class="computeroutput">VG_(saneUCodeBlock)</code> enforces the
1328
following restrictions on calls to helpers:</p>
1329
<div class="itemizedlist"><ul type="disc">
1330
<li><p>Each <code class="computeroutput">CALL</code> uinstr must
1331
be bracketed by a preceding
1332
<code class="computeroutput">CALLM_S</code> marker (dummy
1333
uinstr) and a trailing
1334
<code class="computeroutput">CALLM_E</code> marker. These
1335
markers are used by the instrumentation mechanism later to
1336
establish the boundaries of the
1337
<code class="computeroutput">PUSH</code>,
1338
<code class="computeroutput">POP</code> and
1339
<code class="computeroutput">CLEAR</code> sequences for the
1341
<li><p><code class="computeroutput">PUSH</code>,
1342
<code class="computeroutput">POP</code> and
1343
<code class="computeroutput">CLEAR</code> may only appear inside
1344
sections bracketed by
1345
<code class="computeroutput">CALLM_S</code> and
1346
<code class="computeroutput">CALLM_E</code>, and nowhere else.</p></li>
1347
<li><p>In any such bracketed section, no two
1348
<code class="computeroutput">PUSH</code> insns may push the same
1349
<code class="computeroutput">TempReg</code>. Dually, no two two
1350
<code class="computeroutput">POP</code>s may pop the same
1351
<code class="computeroutput">TempReg</code>.</p></li>
1352
<li><p>Finally, although this is not checked, args should be
1353
removed from the stack with
1354
<code class="computeroutput">CLEAR</code>, rather than
1355
<code class="computeroutput">POP</code>s into a
1356
<code class="computeroutput">TempReg</code> which is not
1357
subsequently used. This is because the instrumentation
1358
mechanism assumes that all values
1359
<code class="computeroutput">POP</code>ped from the stack are
1360
actually used.</p></li>
1362
<p>Some of the translations may appear to have redundant
1363
<code class="computeroutput">TempReg</code>-to-<code class="computeroutput">TempReg</code>
1364
moves. This helps the next phase, UCode optimisation, to
1365
generate better code.</p>
1367
<div class="sect2" lang="en">
1368
<div class="titlepage"><div><div><h3 class="title">
1369
<a name="mc-tech-docs.optim"></a>1.2.7.�UCode optimisation</h3></div></div></div>
1370
<p>UCode is then subjected to an improvement pass
1371
(<code class="computeroutput">vg_improve()</code>), which blurs the
1372
boundaries between the translations of the original x86
1373
instructions. It's pretty straightforward. Three
1374
transformations are done:</p>
1375
<div class="itemizedlist"><ul type="disc">
1377
<p>Redundant <code class="computeroutput">GET</code>
1378
elimination. Actually, more general than that -- eliminates
1379
redundant fetches of ArchRegs. In our running example,
1380
uinstr 3 <code class="computeroutput">GET</code>s
1381
<code class="computeroutput">%EDX</code> into
1382
<code class="computeroutput">t2</code> despite the fact that, by
1383
looking at the previous uinstr, it is already in
1384
<code class="computeroutput">t0</code>. The
1385
<code class="computeroutput">GET</code> is therefore removed,
1386
and <code class="computeroutput">t2</code> renamed to
1387
<code class="computeroutput">t0</code>. Assuming
1388
<code class="computeroutput">t0</code> is allocated to a host
1389
register, it means the simulated
1390
<code class="computeroutput">%EDX</code> will exist in a host
1391
CPU register for more than one simulated x86 instruction,
1392
which seems to me to be a highly desirable property.</p>
1393
<p>There is some mucking around to do with subregisters;
1394
<code class="computeroutput">%AL</code> vs
1395
<code class="computeroutput">%AH</code>
1396
<code class="computeroutput">%AX</code> vs
1397
<code class="computeroutput">%EAX</code> etc. I can't remember
1398
how it works, but in general we are very conservative, and
1399
these tend to invalidate the caching.</p>
1402
<p>Redundant <code class="computeroutput">PUT</code>
1403
elimination. This annuls
1404
<code class="computeroutput">PUT</code>s of values back to
1405
simulated CPU registers if a later
1406
<code class="computeroutput">PUT</code> would overwrite the
1407
earlier <code class="computeroutput">PUT</code> value, and there
1408
is no intervening reads of the simulated register
1409
(<code class="computeroutput">ArchReg</code>).</p>
1410
<p>As before, we are paranoid when faced with subregister
1411
references. Also, <code class="computeroutput">PUT</code>s of
1412
<code class="computeroutput">%ESP</code> are never annulled,
1413
because it is vital the instrumenter always has an up-to-date
1414
<code class="computeroutput">%ESP</code> value available,
1415
<code class="computeroutput">%ESP</code> changes affect
1416
addressibility of the memory around the simulated stack
1418
<p>The implication of the above paragraph is that the
1419
simulated machine's registers are only lazily updated once
1420
the above two optimisation phases have run, with the
1421
exception of <code class="computeroutput">%ESP</code>.
1422
<code class="computeroutput">TempReg</code>s go dead at the end
1423
of every basic block, from which is is inferrable that any
1424
<code class="computeroutput">TempReg</code> caching a simulated
1425
CPU reg is flushed (back into the relevant
1426
<code class="computeroutput">VG_(baseBlock)</code> slot) at the
1427
end of every basic block. The further implication is that
1428
the simulated registers are only up-to-date at in between
1429
basic blocks, and not at arbitrary points inside basic
1430
blocks. And the consequence of that is that we can only
1431
deliver signals to the client in between basic blocks. None
1432
of this seems any problem in practice.</p>
1434
<li><p>Finally there is a simple def-use thing for condition
1435
codes. If an earlier uinstr writes the condition codes, and
1436
the next uinsn along which actually cares about the condition
1437
codes writes the same or larger set of them, but does not
1438
read any, the earlier uinsn is marked as not writing any
1439
condition codes. This saves a lot of redundant cond-code
1440
saving and restoring.</p></li>
1442
<p>The effect of these transformations on our short block is
1443
rather unexciting, and shown below. On longer basic blocks they
1444
can dramatically improve code quality.</p>
1445
<pre class="programlisting">
1446
at 3: delete GET, rename t2 to t0 in (4 .. 6)
1447
at 7: delete GET, rename t6 to t0 in (8 .. 9)
1448
at 1: annul flag write OSZAP due to later OSZACP
1458
9: LEA2L 1(t8,t0,2), t4
1461
12: ANDB t12, t10 (-wOSZACP)
1463
14: Jnzo $0x40435A50 (-rOSZACP)
1464
15: JMPo $0x40435A5B</pre>
1466
<div class="sect2" lang="en">
1467
<div class="titlepage"><div><div><h3 class="title">
1468
<a name="mc-tech-docs.instrum"></a>1.2.8.�UCode instrumentation</h3></div></div></div>
1469
<p>Once you understand the meaning of the instrumentation
1470
uinstrs, discussed in detail above, the instrumentation scheme is
1471
fairly straightforward. Each uinstr is instrumented in
1472
isolation, and the instrumentation uinstrs are placed before the
1473
original uinstr. Our running example continues below. I have
1474
placed a blank line after every original ucode, to make it easier
1475
to see which instrumentation uinstrs correspond to which
1477
<p>As mentioned somewhere above,
1478
<code class="computeroutput">TempReg</code>s carrying values have
1479
names like <code class="computeroutput">t28</code>, and each one has
1480
a shadow carrying its V bits, with names like
1481
<code class="computeroutput">q28</code>. This pairing aids in
1482
reading instrumented ucode.</p>
1483
<p>One decision about all this is where to have "observation
1484
points", that is, where to check that V bits are valid. I use a
1485
minimalistic scheme, only checking where a failure of validity
1486
could cause the original program to (seg)fault. So the use of
1487
values as memory addresses causes a check, as do conditional
1488
jumps (these cause a check on the definedness of the condition
1489
codes). And arguments <code class="computeroutput">PUSH</code>ed
1490
for helper calls are checked, hence the weird restrictions on
1491
help call preambles described above.</p>
1492
<p>Another decision is that once a value is tested, it is
1493
thereafter regarded as defined, so that we do not emit multiple
1494
undefined-value errors for the same undefined value. That means
1495
that <code class="computeroutput">TESTV</code> uinstrs are always
1496
followed by <code class="computeroutput">SETV</code> on the same
1497
(shadow) <code class="computeroutput">TempReg</code>s. Most of
1498
these <code class="computeroutput">SETV</code>s are redundant and
1499
are removed by the post-instrumentation cleanup phase.</p>
1500
<p>The instrumentation for calling helper functions deserves
1501
further comment. The definedness of results from a helper is
1502
modelled using just one V bit. So, in short, we do pessimising
1503
casts of the definedness of all the args, down to a single bit,
1504
and then <code class="computeroutput">UifU</code> these bits
1505
together. So this single V bit will say "undefined" if any part
1506
of any arg is undefined. This V bit is then pessimally cast back
1507
up to the result(s) sizes, as needed. If, by seeing that all the
1508
args are got rid of with <code class="computeroutput">CLEAR</code>
1509
and none with <code class="computeroutput">POP</code>, Valgrind sees
1510
that the result of the call is not actually used, it immediately
1511
examines the result V bit with a
1512
<code class="computeroutput">TESTV</code> --
1513
<code class="computeroutput">SETV</code> pair. If it did not do
1514
this, there would be no observation point to detect that the some
1515
of the args to the helper were undefined. Of course, if the
1516
helper's results are indeed used, we don't do this, since the
1517
result usage will presumably cause the result definedness to be
1518
checked at some suitable future point.</p>
1519
<p>In general Valgrind tries to track definedness on a
1520
bit-for-bit basis, but as the above para shows, for calls to
1521
helpers we throw in the towel and approximate down to a single
1522
bit. This is because it's too complex and difficult to track
1523
bit-level definedness through complex ops such as integer
1524
multiply and divide, and in any case there is no reasonable code
1525
fragments which attempt to (eg) multiply two partially-defined
1526
values and end up with something meaningful, so there seems
1527
little point in modelling multiplies, divides, etc, in that level
1529
<p>Integer loads and stores are instrumented with firstly a
1530
test of the definedness of the address, followed by a
1531
<code class="computeroutput">LOADV</code> or
1532
<code class="computeroutput">STOREV</code> respectively. These turn
1533
into calls to (for example)
1534
<code class="computeroutput">VG_(helperc_LOADV4)</code>. These
1535
helpers do two things: they perform an address-valid check, and
1536
they load or store V bits from/to the relevant address in the
1537
(simulated V-bit) memory.</p>
1538
<p>FPU loads and stores are different. As above the
1539
definedness of the address is first tested. However, the helper
1540
routine for FPU loads
1541
(<code class="computeroutput">VGM_(fpu_read_check)</code>) emits an
1542
error if either the address is invalid or the referenced area
1543
contains undefined values. It has to do this because we do not
1544
simulate the FPU at all, and so cannot track definedness of
1545
values loaded into it from memory, so we have to check them as
1546
soon as they are loaded into the FPU, ie, at this point. We
1547
notionally assume that everything in the FPU is defined.</p>
1548
<p>It follows therefore that FPU writes first check the
1549
definedness of the address, then the validity of the address, and
1550
finally mark the written bytes as well-defined.</p>
1551
<p>If anyone is inspired to extend Valgrind to MMX/SSE insns,
1552
I suggest you use the same trick. It works provided that the
1553
FPU/MMX unit is not used to merely as a conduit to copy partially
1554
undefined data from one place in memory to another.
1555
Unfortunately the integer CPU is used like that (when copying C
1556
structs with holes, for example) and this is the cause of much of
1557
the elaborateness of the instrumentation here described.</p>
1558
<p><code class="computeroutput">vg_instrument()</code> in
1559
<code class="filename">vg_translate.c</code> actually does the
1560
instrumentation. There are comments explaining how each uinstr
1561
is handled, so we do not repeat that here. As explained already,
1562
it is bit-accurate, except for calls to helper functions.
1563
Unfortunately the x86 insns
1564
<code class="computeroutput">bt/bts/btc/btr</code> are done by
1565
helper fns, so bit-level accuracy is lost there. This should be
1566
fixed by doing them inline; it will probably require adding a
1567
couple new uinstrs. Also, left and right rotates through the
1568
carry flag (x86 <code class="computeroutput">rcl</code> and
1569
<code class="computeroutput">rcr</code>) are approximated via a
1570
single V bit; so far this has not caused anyone to complain. The
1571
non-carry rotates, <code class="computeroutput">rol</code> and
1572
<code class="computeroutput">ror</code>, are much more common and
1573
are done exactly. Re-visiting the instrumentation for AND and
1574
OR, they seem rather verbose, and I wonder if it could be done
1575
more concisely now.</p>
1576
<p>The lowercase <code class="computeroutput">o</code> on many of
1577
the uopcodes in the running example indicates that the size field
1578
is zero, usually meaning a single-bit operation.</p>
1579
<p>Anyroads, the post-instrumented version of our running
1580
example looks like this:</p>
1581
<pre class="programlisting">
1586
2: TAG1o q0 = Left4 ( q0 )
1597
10: TAG1o q0 = SWiden14 ( q0 )
1608
18: TAG2o q4 = UifU4 ( q8, q4 )
1609
19: TAG1o q4 = Left4 ( q4 )
1610
20: LEA2L 1(t8,t0,2), t4
1614
23: LOADVB (t4), q10
1621
28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
1622
29: TAG2o q10 = UifU1 ( q12, q10 )
1623
30: TAG2o q10 = DifD1 ( q14, q10 )
1625
32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 )
1626
33: TAG2o q10 = DifD1 ( q14, q10 )
1628
35: TAG1o q16 = PCast10 ( q16 )
1630
37: ANDB t12, t10 (-wOSZACP)
1637
42: Jnzo $0x40435A50 (-rOSZACP)
1639
43: JMPo $0x40435A5B</pre>
1641
<div class="sect2" lang="en">
1642
<div class="titlepage"><div><div><h3 class="title">
1643
<a name="mc-tech-docs.cleanup"></a>1.2.9.�UCode post-instrumentation cleanup</h3></div></div></div>
1644
<p>This pass, coordinated by
1645
<code class="computeroutput">vg_cleanup()</code>, removes redundant
1646
definedness computation created by the simplistic instrumentation
1647
pass. It consists of two passes,
1648
<code class="computeroutput">vg_propagate_definedness()</code>
1650
<code class="computeroutput">vg_delete_redundant_SETVs</code>.</p>
1651
<p><code class="computeroutput">vg_propagate_definedness()</code>
1652
is a simple constant-propagation and constant-folding pass. It
1653
tries to determine which
1654
<code class="computeroutput">TempReg</code>s containing V bits will
1655
always indicate "fully defined", and it propagates this
1656
information as far as it can, and folds out as many operations as
1657
possible. For example, the instrumentation for an ADD of a
1658
literal to a variable quantity will be reduced down so that the
1659
definedness of the result is simply the definedness of the
1660
variable quantity, since the literal is by definition fully
1662
<p><code class="computeroutput">vg_delete_redundant_SETVs</code>
1663
removes <code class="computeroutput">SETV</code>s on shadow
1664
<code class="computeroutput">TempReg</code>s for which the next
1665
action is a write. I don't think there's anything else worth
1666
saying about this; it is simple. Read the sources for
1668
<p>So the cleaned-up running example looks like this. As
1669
above, I have inserted line breaks after every original
1670
(non-instrumentation) uinstr to aid readability. As with
1671
straightforward ucode optimisation, the results in this block are
1672
undramatic because it is so short; longer blocks benefit more
1673
because they have more redundancy which gets eliminated.</p>
1674
<pre class="programlisting">
1675
at 29: delete UifU1 due to defd arg1
1676
at 32: change ImproveAND1_TQ to MOV due to defd arg2
1686
2: TAG1o q0 = Left4 ( q0 )
1696
10: TAG1o q0 = SWiden14 ( q0 )
1707
18: TAG2o q4 = UifU4 ( q8, q4 )
1708
19: TAG1o q4 = Left4 ( q4 )
1709
20: LEA2L 1(t8,t0,2), t4
1712
23: LOADVB (t4), q10
1718
28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
1719
30: TAG2o q10 = DifD1 ( q14, q10 )
1721
33: TAG2o q10 = DifD1 ( q14, q10 )
1723
35: TAG1o q16 = PCast10 ( q16 )
1725
37: ANDB t12, t10 (-wOSZACP)
1730
42: Jnzo $0x40435A50 (-rOSZACP)
1732
43: JMPo $0x40435A5B</pre>
1734
<div class="sect2" lang="en">
1735
<div class="titlepage"><div><div><h3 class="title">
1736
<a name="mc-tech-docs.transfrom"></a>1.2.10.�Translation from UCode</h3></div></div></div>
1737
<p>This is all very simple, even though
1738
<code class="filename">vg_from_ucode.c</code> is a big file.
1739
Position-independent x86 code is generated into a dynamically
1740
allocated array <code class="computeroutput">emitted_code</code>;
1741
this is doubled in size when it overflows. Eventually the array
1742
is handed back to the caller of
1743
<code class="computeroutput">VG_(translate)</code>, who must copy
1744
the result into TC and TT, and free the array.</p>
1745
<p>This file is structured into four layers of abstraction,
1746
which, thankfully, are glued back together with extensive
1747
<code class="computeroutput">__inline__</code> directives. From the
1749
<div class="itemizedlist"><ul type="disc">
1750
<li><p>Address-mode emitters,
1751
<code class="computeroutput">emit_amode_regmem_reg</code> et
1753
<li><p>Emitters for specific x86 instructions. There are
1754
quite a lot of these, with names such as
1755
<code class="computeroutput">emit_movv_offregmem_reg</code>.
1756
The <code class="computeroutput">v</code> suffix is Intel
1757
parlance for a 16/32 bit insn; there are also
1758
<code class="computeroutput">b</code> suffixes for 8 bit
1760
<li><p>The next level up are the
1761
<code class="computeroutput">synth_*</code> functions, which
1762
synthesise possibly a sequence of raw x86 instructions to do
1763
some simple task. Some of these are quite complex because
1764
they have to work around Intel's silly restrictions on
1765
subregister naming. See
1766
<code class="computeroutput">synth_nonshiftop_reg_reg</code> for
1768
<li><p>Finally, at the top of the heap, we have
1769
<code class="computeroutput">emitUInstr()</code>, which emits
1770
code for a single uinstr.</p></li>
1772
<p>Some comments:</p>
1773
<div class="itemizedlist"><ul type="disc">
1775
<p>The hack for FPU instructions becomes apparent here.
1776
To do a <code class="computeroutput">FPU</code> ucode
1777
instruction, we load the simulated FPU's state into from its
1778
<code class="computeroutput">VG_(baseBlock)</code> into the real
1779
FPU using an x86 <code class="computeroutput">frstor</code>
1780
insn, do the ucode <code class="computeroutput">FPU</code> insn
1781
on the real CPU, and write the updated FPU state back into
1782
<code class="computeroutput">VG_(baseBlock)</code> using an
1783
<code class="computeroutput">fnsave</code> instruction. This is
1784
pretty brutal, but is simple and it works, and even seems
1785
tolerably efficient. There is no attempt to cache the
1786
simulated FPU state in the real FPU over multiple
1787
back-to-back ucode FPU instructions.</p>
1788
<p><code class="computeroutput">FPU_R</code> and
1789
<code class="computeroutput">FPU_W</code> are also done this
1790
way, with the minor complication that we need to patch in
1791
some addressing mode bits so the resulting insn knows the
1792
effective address to use. This is easy because of the
1793
regularity of the x86 FPU instruction encodings.</p>
1795
<li><p>An analogous trick is done with ucode insns which
1796
claim, in their <code class="computeroutput">flags_r</code> and
1797
<code class="computeroutput">flags_w</code> fields, that they
1798
read or write the simulated
1799
<code class="computeroutput">%EFLAGS</code>. For such cases we
1800
first copy the simulated
1801
<code class="computeroutput">%EFLAGS</code> into the real
1802
<code class="computeroutput">%eflags</code>, then do the insn,
1803
then, if the insn says it writes the flags, copy back to
1804
<code class="computeroutput">%EFLAGS</code>. This is a bit
1805
expensive, which is why the ucode optimisation pass goes to
1806
some effort to remove redundant flag-update annotations.</p></li>
1808
<p>And so ... that's the end of the documentation for the
1809
instrumentating translator! It's really not that complex,
1810
because it's composed as a sequence of simple(ish) self-contained
1811
transformations on straight-line blocks of code.</p>
1813
<div class="sect2" lang="en">
1814
<div class="titlepage"><div><div><h3 class="title">
1815
<a name="mc-tech-docs.dispatch"></a>1.2.11.�Top-level dispatch loop</h3></div></div></div>
1816
<p>Urk. In <code class="computeroutput">VG_(toploop)</code>.
1817
This is basically boring and unsurprising, not to mention fiddly
1818
and fragile. It needs to be cleaned up.</p>
1819
<p>The only perhaps surprise is that the whole thing is run on
1820
top of a <code class="computeroutput">setjmp</code>-installed
1821
exception handler, because, supposing a translation got a
1822
segfault, we have to bail out of the Valgrind-supplied exception
1823
handler <code class="computeroutput">VG_(oursignalhandler)</code>
1824
and immediately start running the client's segfault handler, if
1825
it has one. In particular we can't finish the current basic
1826
block and then deliver the signal at some convenient future
1827
point, because signals like SIGILL, SIGSEGV and SIGBUS mean that
1828
the faulting insn should not simply be re-tried. (I'm sure there
1829
is a clearer way to explain this).</p>
1831
<div class="sect2" lang="en">
1832
<div class="titlepage"><div><div><h3 class="title">
1833
<a name="mc-tech-docs.lazy"></a>1.2.12.�Lazy updates of the simulated program counter</h3></div></div></div>
1834
<p>Simulated <code class="computeroutput">%EIP</code> is not
1835
updated after every simulated x86 insn as this was regarded as
1836
too expensive. Instead ucode
1837
<code class="computeroutput">INCEIP</code> insns move it along as
1838
and when necessary. Currently we don't allow it to fall more
1839
than 4 bytes behind reality (see
1840
<code class="computeroutput">VG_(disBB)</code> for the way this
1842
<p>Note that <code class="computeroutput">%EIP</code> is always
1843
brought up to date by the inner dispatch loop in
1844
<code class="computeroutput">VG_(dispatch)</code>, so that if the
1845
client takes a fault we know at least which basic block this
1848
<div class="sect2" lang="en">
1849
<div class="titlepage"><div><div><h3 class="title">
1850
<a name="mc-tech-docs.signals"></a>1.2.13.�Signals</h3></div></div></div>
1851
<p>Horrible, horrible. <code class="filename">vg_signals.c</code>.
1852
Basically, since we have to intercept all system calls anyway, we
1853
can see when the client tries to install a signal handler. If it
1854
does so, we make a note of what the client asked to happen, and
1855
ask the kernel to route the signal to our own signal handler,
1856
<code class="computeroutput">VG_(oursignalhandler)</code>. This
1857
simply notes the delivery of signals, and returns.</p>
1858
<p>Every 1000 basic blocks, we see if more signals have
1860
<code class="computeroutput">VG_(deliver_signals)</code> builds
1861
signal delivery frames on the client's stack, and allows their
1862
handlers to be run. Valgrind places in these signal delivery
1863
frames a bogus return address,
1864
<code class="computeroutput">VG_(signalreturn_bogusRA)</code>, and
1865
checks all jumps to see if any jump to it. If so, this is a sign
1866
that a signal handler is returning, and if so Valgrind removes
1867
the relevant signal frame from the client's stack, restores the
1868
from the signal frame the simulated state before the signal was
1869
delivered, and allows the client to run onwards. We have to do
1870
it this way because some signal handlers never return, they just
1871
<code class="computeroutput">longjmp()</code>, which nukes the
1872
signal delivery frame.</p>
1873
<p>The Linux kernel has a different but equally horrible hack
1874
for detecting signal handler returns. Discovering it is left as
1875
an exercise for the reader.</p>
1877
<div class="sect2" lang="en">
1878
<div class="titlepage"><div><div><h3 class="title">
1879
<a name="mc-tech-docs.todo"></a>1.2.14.�To be written</h3></div></div></div>
1880
<p>The following is a list of as-yet-not-written stuff. Apologies.</p>
1881
<div class="orderedlist"><ol type="1">
1882
<li><p>The translation cache and translation table</p></li>
1883
<li><p>Exceptions, creating new translations</p></li>
1884
<li><p>Self-modifying code</p></li>
1885
<li><p>Errors, error contexts, error reporting, suppressions</p></li>
1886
<li><p>Client malloc/free</p></li>
1887
<li><p>Low-level memory management</p></li>
1888
<li><p>A and V bitmaps</p></li>
1889
<li><p>Symbol table management</p></li>
1890
<li><p>Dealing with system calls</p></li>
1891
<li><p>Namespace management</p></li>
1892
<li><p>GDB attaching</p></li>
1893
<li><p>Non-dependence on glibc or anything else</p></li>
1894
<li><p>The leak detector</p></li>
1895
<li><p>Performance problems</p></li>
1896
<li><p>Continuous sanity checking</p></li>
1897
<li><p>Tracing, or not tracing, child processes</p></li>
1898
<li><p>Assembly glue for syscalls</p></li>
1902
<div class="sect1" lang="en">
1903
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
1904
<a name="mc-tech-docs.extensions"></a>1.3.�Extensions</h2></div></div></div>
1905
<p>Some comments about Stuff To Do.</p>
1906
<div class="sect2" lang="en">
1907
<div class="titlepage"><div><div><h3 class="title">
1908
<a name="mc-tech-docs.bugs"></a>1.3.1.�Bugs</h3></div></div></div>
1909
<p>Stephan Kulow and Marc Mutz report problems with kmail in
1910
KDE 3 CVS (RC2 ish) when run on Valgrind. Stephan has it
1911
deadlocking; Marc has it looping at startup. I can't repro
1912
either behaviour. Needs repro-ing and fixing.</p>
1914
<div class="sect2" lang="en">
1915
<div class="titlepage"><div><div><h3 class="title">
1916
<a name="mc-tech-docs.threads"></a>1.3.2.�Threads</h3></div></div></div>
1917
<p>Doing a good job of thread support strikes me as almost a
1918
research-level problem. The central issues are how to do fast
1919
cheap locking of the
1920
<code class="computeroutput">VG_(primary_map)</code> structure,
1921
whether or not accesses to the individual secondary maps need
1922
locking, what race-condition issues result, and whether the
1923
already-nasty mess that is the signal simulator needs further
1925
<p>I realise that threads are the most-frequently-requested
1926
feature, and I am thinking about it all. If you have guru-level
1927
understanding of fast mutual exclusion mechanisms and race
1928
conditions, I would be interested in hearing from you.</p>
1930
<div class="sect2" lang="en">
1931
<div class="titlepage"><div><div><h3 class="title">
1932
<a name="mc-tech-docs.verify"></a>1.3.3.�Verification suite</h3></div></div></div>
1933
<p>Directory <code class="computeroutput">tests/</code> contains
1934
various ad-hoc tests for Valgrind. However, there is no
1935
systematic verification or regression suite, that, for example,
1936
exercises all the stuff in <code class="filename">vg_memory.c</code>, to
1937
ensure that illegal memory accesses and undefined value uses are
1938
detected as they should be. It would be good to have such a
1941
<div class="sect2" lang="en">
1942
<div class="titlepage"><div><div><h3 class="title">
1943
<a name="mc-tech-docs.porting"></a>1.3.4.�Porting to other platforms</h3></div></div></div>
1944
<p>It would be great if Valgrind was ported to FreeBSD and x86
1945
NetBSD, and to x86 OpenBSD, if it's possible (doesn't OpenBSD use
1946
a.out-style executables, not ELF ?)</p>
1947
<p>The main difficulties, for an x86-ELF platform, seem to
1949
<div class="itemizedlist"><ul type="disc">
1950
<li><p>You'd need to rewrite the
1951
<code class="computeroutput">/proc/self/maps</code> parser
1952
(<code class="filename">vg_procselfmaps.c</code>). Easy.</p></li>
1953
<li><p>You'd need to rewrite
1954
<code class="filename">vg_syscall_mem.c</code>, or, more specifically,
1955
provide one for your OS. This is tedious, but you can
1956
implement syscalls on demand, and the Linux kernel interface
1957
is, for the most part, going to look very similar to the *BSD
1958
interfaces, so it's really a copy-paste-and-modify-on-demand
1959
job. As part of this, you'd need to supply a new
1960
<code class="filename">vg_kerneliface.h</code> file.</p></li>
1961
<li><p>You'd also need to change the syscall wrappers for
1962
Valgrind's internal use, in
1963
<code class="filename">vg_mylibc.c</code>.</p></li>
1965
<p>All in all, I think a port to x86-ELF *BSDs is not really
1966
very difficult, and in some ways I would like to see it happen,
1967
because that would force a more clear factoring of Valgrind into
1968
platform dependent and independent pieces. Not to mention, *BSD
1969
folks also deserve to use Valgrind just as much as the Linux crew
1973
<div class="sect1" lang="en">
1974
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
1975
<a name="mc-tech-docs.easystuff"></a>1.4.�Easy stuff which ought to be done</h2></div></div></div>
1976
<div class="sect2" lang="en">
1977
<div class="titlepage"><div><div><h3 class="title">
1978
<a name="mc-tech-docs.mmx"></a>1.4.1.�MMX Instructions</h3></div></div></div>
1979
<p>MMX insns should be supported, using the same trick as for
1980
FPU insns. If the MMX registers are not used to copy
1981
uninitialised junk from one place to another in memory, this
1982
means we don't have to actually simulate the internal MMX unit
1983
state, so the FPU hack applies. This should be fairly
1986
<div class="sect2" lang="en">
1987
<div class="titlepage"><div><div><h3 class="title">
1988
<a name="mc-tech-docs.fixstabs"></a>1.4.2.�Fix stabs-info reader</h3></div></div></div>
1989
<p>The machinery in <code class="filename">vg_symtab2.c</code> which
1990
reads "stabs" style debugging info is pretty weak. It usually
1991
correctly translates simulated program counter values into line
1992
numbers and procedure names, but the file name is often
1993
completely wrong. I think the logic used to parse "stabs"
1994
entries is weak. It should be fixed. The simplest solution,
1995
IMO, is to copy either the logic or simply the code out of GNU
1996
binutils which does this; since GDB can clearly get it right,
1997
binutils (or GDB?) must have code to do this somewhere.</p>
1999
<div class="sect2" lang="en">
2000
<div class="titlepage"><div><div><h3 class="title">
2001
<a name="mc-tech-docs.x86instr"></a>1.4.3.�BT/BTC/BTS/BTR</h3></div></div></div>
2002
<p>These are x86 instructions which test, complement, set, or
2003
reset, a single bit in a word. At the moment they are both
2004
incorrectly implemented and incorrectly instrumented.</p>
2005
<p>The incorrect instrumentation is due to use of helper
2006
functions. This means we lose bit-level definedness tracking,
2007
which could wind up giving spurious uninitialised-value use
2008
errors. The Right Thing to do is to invent a couple of new
2009
UOpcodes, I think <code class="computeroutput">GET_BIT</code> and
2010
<code class="computeroutput">SET_BIT</code>, which can be used to
2011
implement all 4 x86 insns, get rid of the helpers, and give
2012
bit-accurate instrumentation rules for the two new
2014
<p>I realised the other day that they are mis-implemented too.
2015
The x86 insns take a bit-index and a register or memory location
2016
to access. For registers the bit index clearly can only be in
2017
the range zero to register-width minus 1, and I assumed the same
2018
applied to memory locations too. But evidently not; for memory
2019
locations the index can be arbitrary, and the processor will
2020
index arbitrarily into memory as a result. This too should be
2021
fixed. Sigh. Presumably indexing outside the immediate word is
2022
not actually used by any programs yet tested on Valgrind, for
2023
otherwise they (presumably) would simply not work at all. If you
2024
plan to hack on this, first check the Intel docs to make sure my
2025
understanding is really correct.</p>
2027
<div class="sect2" lang="en">
2028
<div class="titlepage"><div><div><h3 class="title">
2029
<a name="mc-tech-docs.prefetch"></a>1.4.4.�Using PREFETCH Instructions</h3></div></div></div>
2030
<p>Here's a small but potentially interesting project for
2031
performance junkies. Experiments with valgrind's code generator
2032
and optimiser(s) suggest that reducing the number of instructions
2033
executed in the translations and mem-check helpers gives
2034
disappointingly small performance improvements. Perhaps this is
2035
because performance of Valgrindified code is limited by cache
2036
misses. After all, each read in the original program now gives
2037
rise to at least three reads, one for the
2038
<code class="computeroutput">VG_(primary_map)</code>, one of the
2039
resulting secondary, and the original. Not to mention, the
2040
instrumented translations are 13 to 14 times larger than the
2041
originals. All in all one would expect the memory system to be
2042
hammered to hell and then some.</p>
2043
<p>So here's an idea. An x86 insn involving a read from
2044
memory, after instrumentation, will turn into ucode of the
2046
<pre class="programlisting">
2047
... calculate effective addr, into ta and qa ...
2048
TESTVL qa -- is the addr defined?
2049
LOADV (ta), qloaded -- fetch V bits for the addr
2050
LOAD (ta), tloaded -- do the original load</pre>
2051
<p>At the point where the
2052
<code class="computeroutput">LOADV</code> is done, we know the
2053
actual address (<code class="computeroutput">ta</code>) from which
2054
the real <code class="computeroutput">LOAD</code> will be done. We
2055
also know that the <code class="computeroutput">LOADV</code> will
2056
take around 20 x86 insns to do. So it seems plausible that doing
2057
a prefetch of <code class="computeroutput">ta</code> just before the
2058
<code class="computeroutput">LOADV</code> might just avoid a miss at
2059
the <code class="computeroutput">LOAD</code> point, and that might
2060
be a significant performance win.</p>
2061
<p>Prefetch insns are notoriously tempermental, more often
2062
than not making things worse rather than better, so this would
2063
require considerable fiddling around. It's complicated because
2064
Intels and AMDs have different prefetch insns with different
2065
semantics, so that too needs to be taken into account. As a
2066
general rule, even placing the prefetches before the
2067
<code class="computeroutput">LOADV</code> insn is too near the
2068
<code class="computeroutput">LOAD</code>; the ideal distance is
2069
apparently circa 200 CPU cycles. So it might be worth having
2070
another analysis/transformation pass which pushes prefetches as
2071
far back as possible, hopefully immediately after the effective
2072
address becomes available.</p>
2073
<p>Doing too many prefetches is also bad because they soak up
2074
bus bandwidth / cpu resources, so some cleverness in deciding
2075
which loads to prefetch and which to not might be helpful. One
2076
can imagine not prefetching client-stack-relative
2077
(<code class="computeroutput">%EBP</code> or
2078
<code class="computeroutput">%ESP</code>) accesses, since the stack
2079
in general tends to show good locality anyway.</p>
2080
<p>There's quite a lot of experimentation to do here, but I
2081
think it might make an interesting week's work for
2083
<p>As of 15-ish March 2002, I've started to experiment with
2085
<code class="computeroutput">prefetch/prefetchw</code> insns.</p>
2087
<div class="sect2" lang="en">
2088
<div class="titlepage"><div><div><h3 class="title">
2089
<a name="mc-tech-docs.pranges"></a>1.4.5.�User-defined Permission Ranges</h3></div></div></div>
2090
<p>This is quite a large project -- perhaps a month's hacking
2091
for a capable hacker to do a good job -- but it's potentially
2092
very interesting. The outcome would be that Valgrind could
2093
detect a whole class of bugs which it currently cannot.</p>
2094
<p>The presentation falls into two pieces.</p>
2095
<div class="sect3" lang="en">
2096
<div class="titlepage"><div><div><h4 class="title">
2097
<a name="mc-tech-docs.psetting"></a>1.4.5.1.�Part 1: User-defined Address-range Permission Setting</h4></div></div></div>
2098
<p>Valgrind intercepts the client's
2099
<code class="computeroutput">malloc</code>,
2100
<code class="computeroutput">free</code>, etc calls, watches system
2101
calls, and watches the stack pointer move. This is currently the
2102
only way it knows about which addresses are valid and which not.
2103
Sometimes the client program knows extra information about its
2104
memory areas. For example, the client could at some point know
2105
that all elements of an array are out-of-date. We would like to
2106
be able to convey to Valgrind this information that the array is
2107
now addressable-but-uninitialised, so that Valgrind can then warn
2108
if elements are used before they get new values.</p>
2109
<p>What I would like are some macros like this:</p>
2110
<pre class="programlisting">
2111
VALGRIND_MAKE_NOACCESS(addr, len)
2112
VALGRIND_MAKE_WRITABLE(addr, len)
2113
VALGRIND_MAKE_READABLE(addr, len)</pre>
2114
<p>and also, to check that memory is
2115
addressible/initialised,</p>
2116
<pre class="programlisting">
2117
VALGRIND_CHECK_ADDRESSIBLE(addr, len)
2118
VALGRIND_CHECK_INITIALISED(addr, len)</pre>
2119
<p>I then include in my sources a header defining these
2120
macros, rebuild my app, run under Valgrind, and get user-defined
2122
<p>Now here's a neat trick. It's a nuisance to have to
2123
re-link the app with some new library which implements the above
2124
macros. So the idea is to define the macros so that the
2125
resulting executable is still completely stand-alone, and can be
2126
run without Valgrind, in which case the macros do nothing, but
2127
when run on Valgrind, the Right Thing happens. How to do this?
2128
The idea is for these macros to turn into a piece of inline
2129
assembly code, which (1) has no effect when run on the real CPU,
2130
(2) is easily spotted by Valgrind's JITter, and (3) no sane
2131
person would ever write, which is important for avoiding false
2132
matches in (2). So here's a suggestion:</p>
2133
<pre class="programlisting">
2134
VALGRIND_MAKE_NOACCESS(addr, len)</pre>
2135
<p>becomes (roughly speaking)</p>
2136
<pre class="programlisting">
2139
movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be
2144
rorl $21, %eax</pre>
2145
<p>The rotate sequences have no effect, and it's unlikely they
2146
would appear for any other reason, but they define a unique
2147
byte-sequence which the JITter can easily spot. Using the
2148
operand constraints section at the end of a gcc inline-assembly
2149
statement, we can tell gcc that the assembly fragment kills
2150
<code class="computeroutput">%eax</code>,
2151
<code class="computeroutput">%ebx</code>,
2152
<code class="computeroutput">%ecx</code> and the condition codes, so
2153
this fragment is made harmless when not running on Valgrind, runs
2154
quickly when not on Valgrind, and does not require any other
2155
library support.</p>
2157
<div class="sect3" lang="en">
2158
<div class="titlepage"><div><div><h4 class="title">
2159
<a name="mc-tech-docs.prange-detect"></a>1.4.5.2.�Part 2: Using it to detect Interference between Stack
2160
Variables</h4></div></div></div>
2161
<p>Currently Valgrind cannot detect errors of the following
2163
<pre class="programlisting">
2164
void fooble ( void )
2170
<p>Now imagine rewriting this as</p>
2171
<pre class="programlisting">
2172
void fooble ( void )
2179
VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
2180
VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
2181
VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
2184
<p>Now the invalid write is certain to hit
2185
<code class="computeroutput">spacer0</code> or
2186
<code class="computeroutput">spacer1</code>, so Valgrind will spot
2188
<p>There are two complications.</p>
2189
<div class="orderedlist"><ol type="1">
2190
<li><p>The first is that we don't want to annotate sources by
2191
hand, so the Right Thing to do is to write a C/C++ parser,
2192
annotator, prettyprinter which does this automatically, and
2193
run it on post-CPP'd C/C++ source. The parser/prettyprinter
2194
is probably not as hard as it sounds; I would write it in Haskell,
2195
a powerful functional language well suited to doing symbolic
2196
computation, with which I am intimately familar. There is
2197
already a C parser written in Haskell by someone in the
2198
Haskell community, and that would probably be a good starting
2201
<p>The second complication is how to get rid of these
2202
<code class="computeroutput">NOACCESS</code> records inside
2203
Valgrind when the instrumented function exits; after all,
2204
these refer to stack addresses and will make no sense
2205
whatever when some other function happens to re-use the same
2206
stack address range, probably shortly afterwards. I think I
2207
would be inclined to define a special stack-specific
2209
<pre class="programlisting">
2210
VALGRIND_MAKE_NOACCESS_STACK(addr, len)</pre>
2211
<p>which causes Valgrind to record the client's
2212
<code class="computeroutput">%ESP</code> at the time it is
2213
executed. Valgrind will then watch for changes in
2214
<code class="computeroutput">%ESP</code> and discard such
2215
records as soon as the protected area is uncovered by an
2216
increase in <code class="computeroutput">%ESP</code>. I
2217
hesitate with this scheme only because it is potentially
2218
expensive, if there are hundreds of such records, and
2219
considering that changes in
2220
<code class="computeroutput">%ESP</code> already require
2221
expensive messing with stack access permissions.</p>
2224
<p>This is probably easier and more robust than for the
2225
instrumenter program to try and spot all exit points for the
2226
procedure and place suitable deallocation annotations there.
2227
Plus C++ procedures can bomb out at any point if they get an
2228
exception, so spotting return points at the source level just
2229
won't work at all.</p>
2230
<p>Although some work, it's all eminently doable, and it would
2231
make Valgrind into an even-more-useful tool.</p>
2237
<br><table class="nav" width="100%" cellspacing="3" cellpadding="2" border="0" summary="Navigation footer">
2239
<td rowspan="2" width="40%" align="left">
2240
<a accesskey="p" href="tech-docs.html"><<�Valgrind Technical Documentation</a>�</td>
2241
<td width="20%" align="center"><a accesskey="u" href="tech-docs.html">Up</a></td>
2242
<td rowspan="2" width="40%" align="right">�<a accesskey="n" href="cg-tech-docs.html">2.�How Cachegrind works�>></a>
2245
<tr><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td></tr>