~bkerensa/ubuntu/raring/valgrind/merge-from-deb

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.tags">1.2.4. UCode operand tags: type <code class="computeroutput">Tag</code></a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.uinstr">1.2.5. UCode instructions: type <code class="computeroutput">UInstr</code></a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.trans">1.2.6. Translation into UCode</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.optim">1.2.7. UCode optimisation</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.instrum">1.2.8. UCode instrumentation</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.cleanup">1.2.9. UCode post-instrumentation cleanup</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.transfrom">1.2.10. Translation from UCode</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.dispatch">1.2.11. Top-level dispatch loop</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.lazy">1.2.12. Lazy updates of the simulated program counter</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.signals">1.2.13. Signals</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.todo">1.2.14. To be written</a></span></dt>

</dl></dd>

<dt><span class="sect1"><a href="mc-tech-docs.html#mc-tech-docs.extensions">1.3. Extensions</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.threads">1.3.2. Threads</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.verify">1.3.3. Verification suite</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.porting">1.3.4. Porting to other platforms</a></span></dt>

</dl></dd>

<dt><span class="sect1"><a href="mc-tech-docs.html#mc-tech-docs.easystuff">1.4. Easy stuff which ought to be done</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.mmx">1.4.1. MMX Instructions</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.fixstabs">1.4.2. Fix stabs-info reader</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.prefetch">1.4.4. Using PREFETCH Instructions</a></span></dt>

<dt><span class="sect2"><a href="mc-tech-docs.html#mc-tech-docs.pranges">1.4.5. User-defined Permission Ranges</a></span></dt>

</dl></dd>

</dl>

</div>

<a name="mc-tech-docs.intro"></a>1.1.�Introduction</h2></div></div></div>

<p>This document contains a detailed, highly-technical description of

the internals of Valgrind. This is not the user manual; if you are an

end-user of Valgrind, you do not want to read this. Conversely, if you

really are a hacker-type and want to know how it works, I assume that

you have read the user manual thoroughly.</p>

<p>You may need to read this document several times, and carefully.

Some important things, I only say once.</p>

<p>[Note: this document is now very old, and a lot of its contents

are out of date, and misleading.]</p>

<a name="mc-tech-docs.history"></a>1.1.1.�History</h3></div></div></div>

<p>Valgrind came into public view in late Feb 2002. However, it has

been under contemplation for a very long time, perhaps seriously for

about five years. Somewhat over two years ago, I started working on the

x86 code generator for the Glasgow Haskell Compiler

(http://www.haskell.org/ghc), gaining familiarity with x86 internals on

the way. I then did Cacheprof, gaining further x86 experience. Some

time around Feb 2000 I started experimenting with a user-space x86

interpreter for x86-Linux. This worked, but it was clear that a

JIT-based scheme would be necessary to give reasonable performance for

Valgrind. Design work for the JITter started in earnest in Oct 2000,

and by early 2001 I had an x86-to-x86 dynamic translator which could run

quite large programs. This translator was in a sense pointless, since

it did not do any instrumentation or checking.</p>

100

<p>Most of the rest of 2001 was taken up designing and implementing

101

the instrumentation scheme. The main difficulty, which consumed a lot

102

of effort, was to design a scheme which did not generate large numbers

103

of false uninitialised-value warnings. By late 2001 a satisfactory

104

scheme had been arrived at, and I started to test it on ever-larger

105

programs, with an eventual eye to making it work well enough so that it

106

was helpful to folks debugging the upcoming version 3 of KDE. I've used

107

KDE since before version 1.0, and wanted to Valgrind to be an indirect

108

contribution to the KDE 3 development effort. At the start of Feb 02

109

the kde-core-devel crew started using it, and gave a huge amount of

110

helpful feedback and patches in the space of three weeks. Snapshot

111

20020306 is the result.</p>

112

<p>In the best Unix tradition, or perhaps in the spirit of Fred

113

Brooks' depressing-but-completely-accurate epitaph "build one to throw

114

away; you will anyway", much of Valgrind is a second or third rendition

115

of the initial idea. The instrumentation machinery

116

(<code class="filename">vg_translate.c</code>, <code class="filename">vg_memory.c</code>)

117

and core CPU simulation (<code class="filename">vg_to_ucode.c</code>,

118

<code class="filename">vg_from_ucode.c</code>) have had three redesigns and

119

rewrites; the register allocator, low-level memory manager

120

(<code class="filename">vg_malloc2.c</code>) and symbol table reader

121

(<code class="filename">vg_symtab2.c</code>) are on the second rewrite. In a

122

sense, this document serves to record some of the knowledge gained as a

123

result.</p>

124

</div>

125

126

127

<a name="mc-tech-docs.overview"></a>1.1.2.�Design overview</h3></div></div></div>

128

<p>Valgrind is compiled into a Linux shared object,

129

<code class="filename">valgrind.so</code>, and also a dummy one,

130

<code class="filename">valgrinq.so</code>, of which more later. The

131

<code class="filename">valgrind</code> shell script adds

132

<code class="filename">valgrind.so</code> to the

133

<code class="computeroutput">LD_PRELOAD</code> list of extra libraries to

134

be loaded with any dynamically linked library. This is a standard

135

trick, one which I assume the

136

<code class="computeroutput">LD_PRELOAD</code> mechanism was developed to

137

support.</p>

138

<p><code class="filename">valgrind.so</code> is linked with the

139

<code class="computeroutput">-z initfirst</code> flag, which

140

requests that its initialisation code is run before that of any

141

other object in the executable image. When this happens,

142

valgrind gains control. The real CPU becomes "trapped" in

143

<code class="filename">valgrind.so</code> and the translations it

144

generates. The synthetic CPU provided by Valgrind does, however,

145

return from this initialisation function. So the normal startup

146

actions, orchestrated by the dynamic linker

147

<code class="filename">ld.so</code>, continue as usual, except on the

148

synthetic CPU, not the real one. Eventually

149

<code class="function">main</code> is run and returns, and

150

then the finalisation code of the shared objects is run,

151

presumably in inverse order to which they were initialised.

152

Remember, this is still all happening on the simulated CPU.

153

Eventually <code class="filename">valgrind.so</code>'s own finalisation

154

code is called. It spots this event, shuts down the simulated

155

CPU, prints any error summaries and/or does leak detection, and

156

returns from the initialisation code on the real CPU. At this

157

point, in effect the real and synthetic CPUs have merged back

158

into one, Valgrind has lost control of the program, and the

159

program finally <code class="function">exit()s</code> back to

160

the kernel in the usual way.</p>

161

<p>The normal course of activity, once Valgrind has started

162

up, is as follows. Valgrind never runs any part of your program

163

(usually referred to as the "client"), not a single byte of it,

164

directly. Instead it uses function

165

<code class="function">VG_(translate)</code> to translate

166

basic blocks (BBs, straight-line sequences of code) into

167

instrumented translations, and those are run instead. The

168

translations are stored in the translation cache (TC),

169

<code class="computeroutput">vg_tc</code>, with the translation

170

table (TT), <code class="computeroutput">vg_tt</code> supplying the

171

original-to-translation code address mapping. Auxiliary array

172

<code class="computeroutput">VG_(tt_fast)</code> is used as a

173

direct-map cache for fast lookups in TT; it usually achieves a

174

hit rate of around 98% and facilitates an orig-to-trans lookup in

175

4 x86 insns, which is not bad.</p>

176

<p>Function <code class="function">VG_(dispatch)</code> in

177

<code class="filename">vg_dispatch.S</code> is the heart of the JIT

178

dispatcher. Once a translated code address has been found, it is

179

executed simply by an x86 <code class="computeroutput">call</code>

180

to the translation. At the end of the translation, the next

181

original code addr is loaded into

182

<code class="computeroutput">%eax</code>, and the translation then

183

does a <code class="computeroutput">ret</code>, taking it back to

184

the dispatch loop, with, interestingly, zero branch

185

mispredictions. The address requested in

186

<code class="computeroutput">%eax</code> is looked up first in

187

<code class="function">VG_(tt_fast)</code>, and, if not found,

188

by calling C helper

189

<code class="function">VG_(search_transtab)</code>. If there

190

is still no translation available,

191

<code class="function">VG_(dispatch)</code> exits back to the

192

top-level C dispatcher

193

<code class="function">VG_(toploop)</code>, which arranges for

194

<code class="function">VG_(translate)</code> to make a new

195

translation. All fairly unsurprising, really. There are various

196

complexities described below.</p>

197

<p>The translator, orchestrated by

198

<code class="function">VG_(translate)</code>, is complicated

199

but entirely self-contained. It is described in great detail in

200

subsequent sections. Translations are stored in TC, with TT

201

tracking administrative information. The translations are

202

subject to an approximate LRU-based management scheme. With the

203

current settings, the TC can hold at most about 15MB of

204

translations, and LRU passes prune it to about 13.5MB. Given

205

that the orig-to-translation expansion ratio is about 13:1 to

206

14:1, this means TC holds translations for more or less a

207

megabyte of original code, which generally comes to about 70000

208

basic blocks for C++ compiled with optimisation on. Generating

209

new translations is expensive, so it is worth having a large TC

210

to minimise the (capacity) miss rate.</p>

211

<p>The dispatcher,

212

<code class="function">VG_(dispatch)</code>, receives hints

213

from the translations which allow it to cheaply spot all control

214

transfers corresponding to x86

215

<code class="computeroutput">call</code> and

216

<code class="computeroutput">ret</code> instructions. It has to do

217

this in order to spot some special events:</p>

218

219

<li><p>Calls to

220

<code class="function">VG_(shutdown)</code>. This is

221

Valgrind's cue to exit. NOTE: actually this is done a

222

different way; it should be cleaned up.</p></li>

223

<li><p>Returns of system call handlers, to the return address

224

<code class="function">VG_(signalreturn_bogusRA)</code>.

225

The signal simulator needs to know when a signal handler is

226

returning, so we spot jumps (returns) to this address.</p></li>

227

<li><p>Calls to <code class="function">vg_trap_here</code>.

228

All <code class="function">malloc</code>,

229

<code class="function">free</code>, etc calls that the

230

client program makes are eventually routed to a call to

231

<code class="function">vg_trap_here</code>, and Valgrind

232

does its own special thing with these calls. In effect this

233

provides a trapdoor, by which Valgrind can intercept certain

234

calls on the simulated CPU, run the call as it sees fit

235

itself (on the real CPU), and return the result to the

236

simulated CPU, quite transparently to the client

237

program.</p></li>

238

</ul></div>

239

<p>Valgrind intercepts the client's

240

<code class="function">malloc</code>,

241

<code class="function">free</code>, etc, calls, so that it can

242

store additional information. Each block

243

<code class="function">malloc</code>'d by the client gives

244

rise to a shadow block in which Valgrind stores the call stack at

245

the time of the <code class="function">malloc</code> call.

246

When the client calls <code class="function">free</code>,

247

Valgrind tries to find the shadow block corresponding to the

248

address passed to <code class="function">free</code>, and

249

emits an error message if none can be found. If it is found, the

250

block is placed on the freed blocks queue

251

<code class="computeroutput">vg_freed_list</code>, it is marked as

252

inaccessible, and its shadow block now records the call stack at

253

the time of the <code class="function">free</code> call.

254

Keeping <code class="computeroutput">free</code>'d blocks in this

255

queue allows Valgrind to spot all (presumably invalid) accesses

256

to them. However, once the volume of blocks in the free queue

257

exceeds <code class="function">VG_(clo_freelist_vol)</code>,

258

blocks are finally removed from the queue.</p>

259

<p>Keeping track of <code class="literal">A</code> and

260

<code class="literal">V</code> bits (note: if you don't know what these

261

are, you haven't read the user guide carefully enough) for memory

262

is done in <code class="filename">vg_memory.c</code>. This implements a

263

sparse array structure which covers the entire 4G address space

264

in a way which is reasonably fast and reasonably space efficient.

265

The 4G address space is divided up into 64K sections, each

266

covering 64Kb of address space. Given a 32-bit address, the top

267

16 bits are used to select one of the 65536 entries in

268

<code class="function">VG_(primary_map)</code>. The resulting

269

"secondary" (<code class="computeroutput">SecMap</code>) holds A and

270

V bits for the 64k of address space chunk corresponding to the

271

lower 16 bits of the address.</p>

272

</div>

273

274

275

<a name="mc-tech-docs.design"></a>1.1.3.�Design decisions</h3></div></div></div>

276

<p>Some design decisions were motivated by the need to make

277

Valgrind debuggable. Imagine you are writing a CPU simulator.

278

It works fairly well. However, you run some large program, like

279

Netscape, and after tens of millions of instructions, it crashes.

280

How can you figure out where in your simulator the bug is?</p>

281

<p>Valgrind's answer is: cheat. Valgrind is designed so that

282

it is possible to switch back to running the client program on

283

the real CPU at any point. Using the

284

<code class="option">--stop-after= </code> flag, you can ask

285

Valgrind to run just some number of basic blocks, and then run

286

the rest of the way on the real CPU. If you are searching for a

287

bug in the simulated CPU, you can use this to do a binary search,

288

which quickly leads you to the specific basic block which is

289

causing the problem.</p>

290

<p>This is all very handy. It does constrain the design in

291

certain unimportant ways. Firstly, the layout of memory, when

292

viewed from the client's point of view, must be identical

293

regardless of whether it is running on the real or simulated CPU.

294

This means that Valgrind can't do pointer swizzling -- well, no

295

great loss -- and it can't run on the same stack as the client --

296

again, no great loss. Valgrind operates on its own stack,

297

<code class="function">VG_(stack)</code>, which it switches to

298

at startup, temporarily switching back to the client's stack when

299

doing system calls for the client.</p>

300

<p>Valgrind also receives signals on its own stack,

301

<code class="computeroutput">VG_(sigstack)</code>, but for different

302

gruesome reasons discussed below.</p>

303

<p>This nice clean

304

switch-back-to-the-real-CPU-whenever-you-like story is muddied by

305

signals. Problem is that signals arrive at arbitrary times and

306

tend to slightly perturb the basic block count, with the result

307

that you can get close to the basic block causing a problem but

308

can't home in on it exactly. My kludgey hack is to define

309

<code class="computeroutput">SIGNAL_SIMULATION</code> to 1 towards

310

the bottom of <code class="filename">vg_syscall_mem.c</code>, so that

311

signal handlers are run on the real CPU and don't change the BB

312

counts.</p>

313

<p>A second hole in the switch-back-to-real-CPU story is that

314

Valgrind's way of delivering signals to the client is different

315

from that of the kernel. Specifically, the layout of the signal

316

delivery frame, and the mechanism used to detect a sighandler

317

returning, are different. So you can't expect to make the

318

transition inside a sighandler and still have things working, but

319

in practice that's not much of a restriction.</p>

320

<p>Valgrind's implementation of

321

<code class="function">malloc</code>,

322

<code class="function">free</code>, etc, (in

323

<code class="filename">vg_clientmalloc.c</code>, not the low-level stuff

324

in <code class="filename">vg_malloc2.c</code>) is somewhat complicated by

325

the need to handle switching back at arbitrary points. It does

326

work tho.</p>

327

</div>

328

329

330

<a name="mc-tech-docs.correctness"></a>1.1.4.�Correctness</h3></div></div></div>

331

<p>There's only one of me, and I have a Real Life (tm) as well

332

as hacking Valgrind [allegedly :-]. That means I don't have time

333

to waste chasing endless bugs in Valgrind. My emphasis is

334

therefore on doing everything as simply as possible, with

335

correctness, stability and robustness being the number one

336

priority, more important than performance or functionality. As a

337

result:</p>

338

339

<li>

340

<p>The code is absolutely loaded with assertions, and

341

these are <span><strong class="command">permanently enabled.</strong></span> I have no

342

plan to remove or disable them later. Over the past couple

343

of months, as valgrind has become more widely used, they have

344

shown their worth, pulling up various bugs which would

345

otherwise have appeared as hard-to-find segmentation

346

faults.</p>

347

<p>I am of the view that it's acceptable to spend 5% of

348

the total running time of your valgrindified program doing

349

assertion checks and other internal sanity checks.</p>

350

</li>

351

<li>

352

<p>Aside from the assertions, valgrind contains various

353

sets of internal sanity checks, which get run at varying

354

frequencies during normal operation.

355

<code class="function">VG_(do_sanity_checks)</code> runs

356

every 1000 basic blocks, which means 500 to 2000 times/second

357

for typical machines at present. It checks that Valgrind

358

hasn't overrun its private stack, and does some simple checks

359

on the memory permissions maps. Once every 25 calls it does

360

some more extensive checks on those maps. Etc, etc.</p>

361

<p>The following components also have sanity check code,

362

which can be enabled to aid debugging:</p>

363

364

<li><p>The low-level memory-manager

365

(<code class="computeroutput">VG_(mallocSanityCheckArena)</code>).

366

This does a complete check of all blocks and chains in an

367

arena, which is very slow. Is not engaged by default.</p></li>

368

<li><p>The symbol table reader(s): various checks to

369

ensure uniqueness of mappings; see

370

<code class="function">VG_(read_symbols)</code> for a

371

start. Is permanently engaged.</p></li>

372

<li><p>The A and V bit tracking stuff in

373

<code class="filename">vg_memory.c</code>. This can be compiled

374

with cpp symbol

375

<code class="computeroutput">VG_DEBUG_MEMORY</code> defined,

376

which removes all the fast, optimised cases, and uses

377

simple-but-slow fallbacks instead. Not engaged by

378

default.</p></li>

379

<li><p>Ditto

380

<code class="computeroutput">VG_DEBUG_LEAKCHECK</code>.</p></li>

381

<li><p>The JITter parses x86 basic blocks into sequences

382

of UCode instructions. It then sanity checks each one

383

with <code class="function">VG_(saneUInstr)</code> and

384

sanity checks the sequence as a whole with

385

<code class="function">VG_(saneUCodeBlock)</code>.

386

This stuff is engaged by default, and has caught some

387

way-obscure bugs in the simulated CPU machinery in its

388

time.</p></li>

389

<li><p>The system call wrapper does

390

<code class="function">VG_(first_and_last_secondaries_look_plausible)</code>

391

after every syscall; this is known to pick up bugs in the

392

syscall wrappers. Engaged by default.</p></li>

393

<li><p>The main dispatch loop, in

394

<code class="function">VG_(dispatch)</code>, checks

395

that translations do not set

396

<code class="computeroutput">%ebp</code> to any value

397

different from

398

<code class="computeroutput">VG_EBP_DISPATCH_CHECKED</code>

399

or <code class="computeroutput">& VG_(baseBlock)</code>.

400

In effect this test is free, and is permanently

401

engaged.</p></li>

402

<li><p>There are a couple of ifdefed-out consistency

403

checks I inserted whilst debugging the new register

404

allocater,

405

<code class="computeroutput">vg_do_register_allocation</code>.</p></li>

406

</ul></div>

407

</li>

408

<li><p>I try to avoid techniques, algorithms, mechanisms, etc,

409

for which I can supply neither a convincing argument that

410

they are correct, nor sanity-check code which might pick up

411

bugs in my implementation. I don't always succeed in this,

412

but I try. Basically the idea is: avoid techniques which

413

are, in practice, unverifiable, in some sense. When doing

414

anything, always have in mind: "how can I verify that this is

415

correct?"</p></li>

416

</ul></div>

417

<p>Some more specific things are:</p>

418

419

<li>

420

<p>Valgrind runs in the same namespace as the client, at

421

least from <code class="filename">ld.so</code>'s point of view, and it

422

therefore absolutely had better not export any symbol with a

423

name which could clash with that of the client or any of its

424

libraries. Therefore, all globally visible symbols exported

425

from <code class="filename">valgrind.so</code> are defined using the

426

<code class="computeroutput">VG_</code> CPP macro. As you'll

427

see from <code class="filename">vg_constants.h</code>, this appends

428

some arbitrary prefix to the symbol, in order that it be, we

429

hope, globally unique. Currently the prefix is

430

<code class="computeroutput">vgPlain_</code>. For convenience

431

there are also <code class="computeroutput">VGM_</code>,

432

<code class="computeroutput">VGP_</code> and

433

<code class="computeroutput">VGOFF_</code>. All locally defined

434

symbols are declared <code class="computeroutput">static</code>

435

and do not appear in the final shared object.</p>

436

<p>To check this, I periodically do <code class="computeroutput">nm

437

valgrind.so | grep " T "</code>, which shows you

438

all the globally exported text symbols. They should all have

439

an approved prefix, except for those like

440

<code class="function">malloc</code>,

441

<code class="function">free</code>, etc, which we

442

deliberately want to shadow and take precedence over the same

443

names exported from <code class="filename">glibc.so</code>, so that

444

valgrind can intercept those calls easily. Similarly,

445

<code class="computeroutput">nm valgrind.so | grep " D "</code>

446

allows you to find any rogue data-segment symbol

447

names.</p>

448

</li>

449

<li>

450

<p>Valgrind tries, and almost succeeds, in being

451

completely independent of all other shared objects, in

452

particular of <code class="filename">glibc.so</code>. For example, we

453

have our own low-level memory manager in

454

<code class="filename">vg_malloc2.c</code>, which is a fairly standard

455

malloc/free scheme augmented with arenas, and

456

<code class="filename">vg_mylibc.c</code> exports reimplementations of

457

various bits and pieces you'd normally get from the C

458

library.</p>

459

<p>Why all the hassle? Because imagine the potential

460

chaos of both the simulated and real CPUs executing in

461

<code class="filename">glibc.so</code>. It just seems simpler and

462

cleaner to be completely self-contained, so that only the

463

simulated CPU visits <code class="filename">glibc.so</code>. In

464

practice it's not much hassle anyway. Also, valgrind starts

465

up before glibc has a chance to initialise itself, and who

466

knows what difficulties that could lead to. Finally, glibc

467

has definitions for some types, specifically

468

<code class="computeroutput">sigset_t</code>, which conflict

469

(are different from) the Linux kernel's idea of same. When

470

Valgrind wants to fiddle around with signal stuff, it wants

471

to use the kernel's definitions, not glibc's definitions. So

472

it's simplest just to keep glibc out of the picture

473

entirely.</p>

474

<p>To find out which glibc symbols are used by Valgrind,

475

reinstate the link flags <code class="computeroutput">-nostdlib

476

-Wl,-no-undefined</code>. This causes linking to

477

fail, but will tell you what you depend on. I have mostly,

478

but not entirely, got rid of the glibc dependencies; what

479

remains is, IMO, fairly harmless. AFAIK the current

480

dependencies are: <code class="computeroutput">memset</code>,

481

<code class="computeroutput">memcmp</code>,

482

<code class="computeroutput">stat</code>,

483

<code class="computeroutput">system</code>,

484

<code class="computeroutput">sbrk</code>,

485

<code class="computeroutput">setjmp</code> and

486

<code class="computeroutput">longjmp</code>.</p>

487

</li>

488

<li>

489

<p>Similarly, valgrind should not really import any

490

headers other than the Linux kernel headers, since it knows

491

of no API other than the kernel interface to talk to. At the

492

moment this is really not in a good state, and

493

<code class="computeroutput">vg_syscall_mem</code> imports, via

494

<code class="filename">vg_unsafe.h</code>, a significant number of

495

C-library headers so as to know the sizes of various structs

496

passed across the kernel boundary. This is of course

497

completely bogus, since there is no guarantee that the C

498

library's definitions of these structs matches those of the

499

kernel. I have started to sort this out using

500

<code class="filename">vg_kerneliface.h</code>, into which I had

501

intended to copy all kernel definitions which valgrind could

502

need, but this has not gotten very far. At the moment it

503

mostly contains definitions for

504

<code class="computeroutput">sigset_t</code> and

505

<code class="computeroutput">struct sigaction</code>, since the

506

kernel's definition for these really does clash with glibc's.

507

I plan to use a <code class="computeroutput">vki_</code> prefix

508

on all these types and constants, to denote the fact that

509

they pertain to <span><strong class="command">V</strong></span>algrind's

510

<span><strong class="command">K</strong></span>ernel

511

<span><strong class="command">I</strong></span>nterface.</p>

512

<p>Another advantage of having a

513

<code class="filename">vg_kerneliface.h</code> file is that it makes

514

it simpler to interface to a different kernel. Once can, for

515

example, easily imagine writing a new

516

<code class="filename">vg_kerneliface.h</code> for FreeBSD, or x86

517

NetBSD.</p>

518

</li>

519

</ul></div>

520

</div>

521

522

523

<a name="mc-tech-docs.limits"></a>1.1.5.�Current limitations</h3></div></div></div>

524

<p>Support for weird (non-POSIX) signal stuff is patchy. Does

525

anybody care?</p>

526

</div>

527

</div>

528

529

530

<a name="mc-tech-docs.jitter"></a>1.2.�The instrumenting JITter</h2></div></div></div>

531

<p>This really is the heart of the matter. We begin with

532

various side issues.</p>

533

534

535

<a name="mc-tech-docs.storage"></a>1.2.1.�Run-time storage, and the use of host registers</h3></div></div></div>

536

<p>Valgrind translates client (original) basic blocks into

537

instrumented basic blocks, which live in the translation cache

538

TC, until either the client finishes or the translations are

539

ejected from TC to make room for newer ones.</p>

540

<p>Since it generates x86 code in memory, Valgrind has

541

complete control of the use of registers in the translations.

542

Now pay attention. I shall say this only once, and it is

543

important you understand this. In what follows I will refer to

544

registers in the host (real) cpu using their standard names,

545

<code class="computeroutput">%eax</code>,

546

<code class="computeroutput">%edi</code>, etc. I refer to registers

547

in the simulated CPU by capitalising them:

548

<code class="computeroutput">%EAX</code>,

549

<code class="computeroutput">%EDI</code>, etc. These two sets of

550

registers usually bear no direct relationship to each other;

551

there is no fixed mapping between them. This naming scheme is

552

used fairly consistently in the comments in the sources.</p>

553

<p>Host registers, once things are up and running, are used as

554

follows:</p>

555

556

<li><p><code class="computeroutput">%esp</code>, the real stack

557

pointer, points somewhere in Valgrind's private stack area,

558

<code class="computeroutput">VG_(stack)</code> or, transiently,

559

into its signal delivery stack,

560

<code class="computeroutput">VG_(sigstack)</code>.</p></li>

561

<li><p><code class="computeroutput">%edi</code> is used as a

562

temporary in code generation; it is almost always dead,

563

except when used for the

564

<code class="computeroutput">Left</code> value-tag operations.</p></li>

565

<li><p><code class="computeroutput">%eax</code>,

566

<code class="computeroutput">%ebx</code>,

567

<code class="computeroutput">%ecx</code>,

568

<code class="computeroutput">%edx</code> and

569

<code class="computeroutput">%esi</code> are available to

570

Valgrind's register allocator. They are dead (carry

571

unimportant values) in between translations, and are live

572

only in translations. The one exception to this is

573

<code class="computeroutput">%eax</code>, which, as mentioned

574

far above, has a special significance to the dispatch loop

575

<code class="computeroutput">VG_(dispatch)</code>: when a

576

translation returns to the dispatch loop,

577

<code class="computeroutput">%eax</code> is expected to contain

578

the original-code-address of the next translation to run.

579

The register allocator is so good at minimising spill code

580

that using five regs and not having to save/restore

581

<code class="computeroutput">%edi</code> actually gives better

582

code than allocating to <code class="computeroutput">%edi</code>

583

as well, but then having to push/pop it around special

584

uses.</p></li>

585

<li><p><code class="computeroutput">%ebp</code> points

586

permanently at

587

<code class="computeroutput">VG_(baseBlock)</code>. Valgrind's

588

translations are position-independent, partly because this is

589

convenient, but also because translations get moved around in

590

TC as part of the LRUing activity. <span><strong class="command">All</strong></span>

591

static entities which need to be referred to from generated

592

code, whether data or helper functions, are stored starting

593

at <code class="computeroutput">VG_(baseBlock)</code> and are

594

therefore reached by indexing from

595

<code class="computeroutput">%ebp</code>. There is but one

596

exception, which is that by placing the value

597

<code class="computeroutput">VG_EBP_DISPATCH_CHECKED</code> in

598

<code class="computeroutput">%ebp</code> just before a return to

599

the dispatcher, the dispatcher is informed that the next

600

address to run, in <code class="computeroutput">%eax</code>,

601

requires special treatment.</p></li>

602

<li><p>The real machine's FPU state is pretty much

603

unimportant, for reasons which will become obvious. Ditto

604

its <code class="computeroutput">%eflags</code> register.</p></li>

605

</ul></div>

606

<p>The state of the simulated CPU is stored in memory, in

607

<code class="computeroutput">VG_(baseBlock)</code>, which is a block

608

of 200 words IIRC. Recall that

609

<code class="computeroutput">%ebp</code> points permanently at the

610

start of this block. Function

611

<code class="computeroutput">vg_init_baseBlock</code> decides what

612

the offsets of various entities in

613

<code class="computeroutput">VG_(baseBlock)</code> are to be, and

614

allocates word offsets for them. The code generator then emits

615

<code class="computeroutput">%ebp</code> relative addresses to get

616

at those things. The sequence in which entities are allocated

617

has been carefully chosen so that the 32 most popular entities

618

come first, because this means 8-bit offsets can be used in the

619

generated code.</p>

620

<p>If I was clever, I could make

621

<code class="computeroutput">%ebp</code> point 32 words along

622

<code class="computeroutput">VG_(baseBlock)</code>, so that I'd have

623

another 32 words of short-form offsets available, but that's just

624

complicated, and it's not important -- the first 32 words take

625

99% (or whatever) of the traffic.</p>

626

<p>Currently, the sequence of stuff in

627

<code class="computeroutput">VG_(baseBlock)</code> is as

628

follows:</p>

629

630

<li><p>9 words, holding the simulated integer registers,

631

632

.. <code class="computeroutput">%EDI</code>, and the simulated

633

flags, <code class="computeroutput">%EFLAGS</code>.</p></li>

634

<li><p>Another 9 words, holding the V bit "shadows" for the

635

above 9 regs.</p></li>

636

<li><p>The <span><strong class="command">addresses</strong></span> of various helper

637

routines called from generated code:

638

<code class="computeroutput">VG_(helper_value_check4_fail)</code>,

639

<code class="computeroutput">VG_(helper_value_check0_fail)</code>,

640

which register V-check failures,

641

<code class="computeroutput">VG_(helperc_STOREV4)</code>,

642

<code class="computeroutput">VG_(helperc_STOREV1)</code>,

643

<code class="computeroutput">VG_(helperc_LOADV4)</code>,

644

<code class="computeroutput">VG_(helperc_LOADV1)</code>, which

645

do stores and loads of V bits to/from the sparse array which

646

keeps track of V bits in memory, and

647

<code class="computeroutput">VGM_(handle_esp_assignment)</code>,

648

which messes with memory addressibility resulting from

649

changes in <code class="computeroutput">%ESP</code>.</p></li>

650

<li><p>The simulated <code class="computeroutput">%EIP</code>.</p></li>

651

<li><p>24 spill words, for when the register allocator can't

652

make it work with 5 measly registers.</p></li>

653

<li><p>Addresses of helpers

654

<code class="computeroutput">VG_(helperc_STOREV2)</code>,

655

<code class="computeroutput">VG_(helperc_LOADV2)</code>. These

656

are here because 2-byte loads and stores are relatively rare,

657

so are placed above the magic 32-word offset boundary.</p></li>

658

<li><p>For similar reasons, addresses of helper functions

659

<code class="computeroutput">VGM_(fpu_write_check)</code> and

660

<code class="computeroutput">VGM_(fpu_read_check)</code>, which

661

handle the A/V maps testing and changes required by FPU

662

writes/reads.</p></li>

663

<li><p>Some other boring helper addresses:

664

<code class="computeroutput">VG_(helper_value_check2_fail)</code>

665

and

666

<code class="computeroutput">VG_(helper_value_check1_fail)</code>.

667

These are probably never emitted now, and should be

668

removed.</p></li>

669

<li><p>The entire state of the simulated FPU, which I believe

670

to be 108 bytes long.</p></li>

671

<li><p>Finally, the addresses of various other helper

672

functions in <code class="filename">vg_helpers.S</code>, which deal

673

with rare situations which are tedious or difficult to

674

generate code in-line for.</p></li>

675

</ul></div>

676

<p>As a general rule, the simulated machine's state lives

677

permanently in memory at

678

<code class="computeroutput">VG_(baseBlock)</code>. However, the

679

JITter does some optimisations which allow the simulated integer

680

registers to be cached in real registers over multiple simulated

681

instructions within the same basic block. These are always

682

flushed back into memory at the end of every basic block, so that

683

the in-memory state is up-to-date between basic blocks. (This

684

flushing is implied by the statement above that the real

685

machine's allocatable registers are dead in between simulated

686

blocks).</p>

687

</div>

688

689

690

<a name="mc-tech-docs.startup"></a>1.2.2.�Startup, shutdown, and system calls</h3></div></div></div>

691

<p>Getting into of Valgrind

692

(<code class="computeroutput">VG_(startup)</code>, called from

693

<code class="filename">valgrind.so</code>'s initialisation section),

694

really means copying the real CPU's state into

695

<code class="computeroutput">VG_(baseBlock)</code>, and then

696

installing our own stack pointer, etc, into the real CPU, and

697

then starting up the JITter. Exiting valgrind involves copying

698

the simulated state back to the real state.</p>

699

<p>Unfortunately, there's a complication at startup time.

700

Problem is that at the point where we need to take a snapshot of

701

the real CPU's state, the offsets in

702

<code class="computeroutput">VG_(baseBlock)</code> are not set up

703

yet, because to do so would involve disrupting the real machine's

704

state significantly. The way round this is to dump the real

705

machine's state into a temporary, static block of memory,

706

<code class="computeroutput">VG_(m_state_static)</code>. We can

707

then set up the <code class="computeroutput">VG_(baseBlock)</code>

708

offsets at our leisure, and copy into it from

709

<code class="computeroutput">VG_(m_state_static)</code> at some

710

convenient later time. This copying is done by

711

<code class="computeroutput">VG_(copy_m_state_static_to_baseBlock)</code>.</p>

712

<p>On exit, the inverse transformation is (rather

713

unnecessarily) used: stuff in

714

<code class="computeroutput">VG_(baseBlock)</code> is copied to

715

<code class="computeroutput">VG_(m_state_static)</code>, and the

716

assembly stub then copies from

717

<code class="computeroutput">VG_(m_state_static)</code> into the

718

real machine registers.</p>

719

<p>Doing system calls on behalf of the client

720

(<code class="filename">vg_syscall.S</code>) is something of a half-way

721

house. We have to make the world look sufficiently like that

722

which the client would normally have to make the syscall actually

723

work properly, but we can't afford to lose control. So the trick

724

is to copy all of the client's state, <span><strong class="command">except its program

725

counter</strong></span>, into the real CPU, do the system call, and

726

copy the state back out. Note that the client's state includes

727

its stack pointer register, so one effect of this partial

728

restoration is to cause the system call to be run on the client's

729

stack, as it should be.</p>

730

<p>As ever there are complications. We have to save some of

731

our own state somewhere when restoring the client's state into

732

the CPU, so that we can keep going sensibly afterwards. In fact

733

the only thing which is important is our own stack pointer, but

734

for paranoia reasons I save and restore our own FPU state as

735

well, even though that's probably pointless.</p>

736

<p>The complication on the above complication is, that for

737

horrible reasons to do with signals, we may have to handle a

738

second client system call whilst the client is blocked inside

739

some other system call (unbelievable!). That means there's two

740

sets of places to dump Valgrind's stack pointer and FPU state

741

across the syscall, and we decide which to use by consulting

742

<code class="computeroutput">VG_(syscall_depth)</code>, which is in

743

turn maintained by

744

<code class="computeroutput">VG_(wrap_syscall)</code>.</p>

745

</div>

746

747

748

<a name="mc-tech-docs.ucode"></a>1.2.3.�Introduction to UCode</h3></div></div></div>

749

<p>UCode lies at the heart of the x86-to-x86 JITter. The

750

basic premise is that dealing the the x86 instruction set head-on

751

is just too darn complicated, so we do the traditional

752

compiler-writer's trick and translate it into a simpler,

753

easier-to-deal-with form.</p>

754

<p>In normal operation, translation proceeds through six

755

stages, coordinated by

756

<code class="computeroutput">VG_(translate)</code>:</p>

757

758

<li><p>Parsing of an x86 basic block into a sequence of UCode

759

instructions (<code class="computeroutput">VG_(disBB)</code>).</p></li>

760

<li><p>UCode optimisation

761

(<code class="computeroutput">vg_improve</code>), with the aim

762

of caching simulated registers in real registers over

763

multiple simulated instructions, and removing redundant

764

simulated <code class="computeroutput">%EFLAGS</code>

765

saving/restoring.</p></li>

766

<li><p>UCode instrumentation

767

(<code class="computeroutput">vg_instrument</code>), which adds

768

value and address checking code.</p></li>

769

<li><p>Post-instrumentation cleanup

770

(<code class="computeroutput">vg_cleanup</code>), removing

771

redundant value-check computations.</p></li>

772

<li><p>Register allocation

773

(<code class="computeroutput">vg_do_register_allocation</code>),

774

which, note, is done on UCode.</p></li>

775

<li><p>Emission of final instrumented x86 code

776

(<code class="computeroutput">VG_(emit_code)</code>).</p></li>

777

</ol></div>

778

<p>Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode

779

transformation passes, all on straight-line blocks of UCode (type

780

<code class="computeroutput">UCodeBlock</code>). Steps 2 and 4 are

781

optimisation passes and can be disabled for debugging purposes,

782

with <code class="option">--optimise=no</code> and

783

<code class="option">--cleanup=no</code> respectively.</p>

784

<p>Valgrind can also run in a no-instrumentation mode, given

785

<code class="option">--instrument=no</code>. This is useful

786

for debugging the JITter quickly without having to deal with the

787

complexity of the instrumentation mechanism too. In this mode,

788

steps 3 and 4 are omitted.</p>

789

<p>These flags combine, so that

790

<code class="option">--instrument=no</code> together with

791

<code class="option">--optimise=no</code> means only steps

792

1, 5 and 6 are used.

793

<code class="option">--single-step=yes</code> causes each

794

x86 instruction to be treated as a single basic block. The

795

translations are terrible but this is sometimes instructive.</p>

796

<p>The <code class="option">--stop-after=N</code> flag

797

switches back to the real CPU after

798

<code class="computeroutput">N</code> basic blocks. It also re-JITs

799

the final basic block executed and prints the debugging info

800

resulting, so this gives you a way to get a quick snapshot of how

801

a basic block looks as it passes through the six stages mentioned

802

above. If you want to see full information for every block

803

translated (probably not, but still ...) find, in

804

<code class="computeroutput">VG_(translate)</code>, the lines</p>

805

806

dis = True;

807

dis = debugging_translation;</pre>

808

<p>and comment out the second line. This will spew out

809

debugging junk faster than you can possibly imagine.</p>

810

</div>

811

812

813

<a name="mc-tech-docs.tags"></a>1.2.4.�UCode operand tags: type <code class="computeroutput">Tag</code></h3></div></div></div>

814

<p>UCode is, more or less, a simple two-address RISC-like

815

code. In keeping with the x86 AT&T assembly syntax,

816

generally speaking the first operand is the source operand, and

817

the second is the destination operand, which is modified when the

818

uinstr is notionally executed.</p>

819

<p>UCode instructions have up to three operand fields, each of

820

which has a corresponding <code class="computeroutput">Tag</code>

821

describing it. Possible values for the tag are:</p>

822

823

<li><p><code class="computeroutput">NoValue</code>: indicates

824

that the field is not in use.</p></li>

825

<li><p><code class="computeroutput">Lit16</code>: the field

826

contains a 16-bit literal.</p></li>

827

<li><p><code class="computeroutput">Literal</code>: the field

828

denotes a 32-bit literal, whose value is stored in the

829

<code class="computeroutput">lit32</code> field of the uinstr

830

itself. Since there is only one

831

<code class="computeroutput">lit32</code> for the whole uinstr,

832

only one operand field may contain this tag.</p></li>

833

<li><p><code class="computeroutput">SpillNo</code>: the field

834

contains a spill slot number, in the range 0 to 23 inclusive,

835

denoting one of the spill slots contained inside

836

<code class="computeroutput">VG_(baseBlock)</code>. Such tags

837

only exist after register allocation.</p></li>

838

<li><p><code class="computeroutput">RealReg</code>: the field

839

contains a number in the range 0 to 7 denoting an integer x86

840

("real") register on the host. The number is the Intel

841

encoding for integer registers. Such tags only exist after

842

843

<li><p><code class="computeroutput">ArchReg</code>: the field

844

contains a number in the range 0 to 7 denoting an integer x86

845

846

reference to one of the first 8 words of

847

<code class="computeroutput">VG_(baseBlock)</code>. Such tags

848

can exist at any point in the translation process.</p></li>

849

<li>

850

<p>Last, but not least,

851

<code class="computeroutput">TempReg</code>. The field contains

852

the number of one of an infinite set of virtual (integer)

853

registers. <code class="computeroutput">TempReg</code>s are used

854

everywhere throughout the translation process; you can have

855

as many as you want. The register allocator maps as many as

856

it can into <code class="computeroutput">RealReg</code>s and

857

turns the rest into

858

<code class="computeroutput">SpillNo</code>s, so

859

<code class="computeroutput">TempReg</code>s should not exist

860

after the register allocation phase.</p>

861

<p><code class="computeroutput">TempReg</code>s are always 32

862

bits long, even if the data they hold is logically shorter.

863

In that case the upper unused bits are required, and, I

864

think, generally assumed, to be zero.

865

<code class="computeroutput">TempReg</code>s holding V bits for

866

quantities shorter than 32 bits are expected to have ones in

867

the unused places, since a one denotes "undefined".</p>

868

</li>

869

</ul></div>

870

</div>

871

872

873

<a name="mc-tech-docs.uinstr"></a>1.2.5.�UCode instructions: type <code class="computeroutput">UInstr</code></h3></div></div></div>

874

<p>UCode was carefully designed to make it possible to do

875

876

x86 code without needing any extra registers ... well, that was

877

the original plan, anyway. Things have gotten a little more

878

complicated since then. In what follows, UCode instructions are

879

referred to as uinstrs, to distinguish them from x86

880

instructions. Uinstrs of course have uopcodes which are

881

(naturally) different from x86 opcodes.</p>

882

<p>A uinstr (type <code class="computeroutput">UInstr</code>)

883

contains various fields, not all of which are used by any one

884

uopcode:</p>

885

886

<li><p>Three 16-bit operand fields,

887

<code class="computeroutput">val1</code>,

888

<code class="computeroutput">val2</code> and

889

890

<li><p>Three tag fields,

891

<code class="computeroutput">tag1</code>,

892

<code class="computeroutput">tag2</code> and

893

<code class="computeroutput">tag3</code>. Each of these has a

894

value of type <code class="computeroutput">Tag</code>, and they

895

describe what the <code class="computeroutput">val1</code>,

896

<code class="computeroutput">val2</code> and

897

<code class="computeroutput">val3</code> fields contain.</p></li>

898

<li><p>A 32-bit literal field.</p></li>

899

<li><p>Two <code class="computeroutput">FlagSet</code>s,

900

specifying which x86 condition codes are read and written by

901

the uinstr.</p></li>

902

<li><p>An opcode byte, containing a value of type

903

<code class="computeroutput">Opcode</code>.</p></li>

904

<li><p>A size field, indicating the data transfer size

905

(1/2/4/8/10) in cases where this makes sense, or zero

906

otherwise.</p></li>

907

<li><p>A condition-code field, which, for jumps, holds a value

908

of type <code class="computeroutput">Condcode</code>, indicating

909

the condition which applies. The encoding is as it is in the

910

x86 insn stream, except we add a 17th value

911

<code class="computeroutput">CondAlways</code> to indicate an

912

unconditional transfer.</p></li>

913

<li><p>Various 1-bit flags, indicating whether this insn

914

pertains to an x86 CALL or RET instruction, whether a

915

widening is signed or not, etc.</p></li>

916

</ul></div>

917

<p>UOpcodes (type <code class="computeroutput">Opcode</code>) are

918

divided into two groups: those necessary merely to express the

919

functionality of the x86 code, and extra uopcodes needed to

920

express the instrumentation. The former group contains:</p>

921

922

<li><p><code class="computeroutput">GET</code> and

923

<code class="computeroutput">PUT</code>, which move values from

924

the simulated CPU's integer registers

925

(<code class="computeroutput">ArchReg</code>s) into

926

<code class="computeroutput">TempReg</code>s, and back.

927

<code class="computeroutput">GETF</code> and

928

<code class="computeroutput">PUTF</code> do the corresponding

929

thing for the simulated

930

<code class="computeroutput">%EFLAGS</code>. There are no

931

corresponding insns for the FPU register stack, since we

932

don't explicitly simulate its registers.</p></li>

933

<li><p><code class="computeroutput">LOAD</code> and

934

<code class="computeroutput">STORE</code>, which, in RISC-like

935

fashion, are the only uinstrs able to interact with

936

memory.</p></li>

937

<li><p><code class="computeroutput">MOV</code> and

938

<code class="computeroutput">CMOV</code> allow unconditional and

939

conditional moves of values between

940

<code class="computeroutput">TempReg</code>s.</p></li>

941

<li>

942

<p>ALU operations. Again in RISC-like fashion, these only

943

operate on <code class="computeroutput">TempReg</code>s (before

944

reg-alloc) or <code class="computeroutput">RealReg</code>s

945

(after reg-alloc). These are:

946

<code class="computeroutput">ADD</code>,

947

<code class="computeroutput">ADC</code>,

948

<code class="computeroutput">AND</code>,

949

<code class="computeroutput">OR</code>,

950

<code class="computeroutput">XOR</code>,

951

<code class="computeroutput">SUB</code>,

952

<code class="computeroutput">SBB</code>,

953

<code class="computeroutput">SHL</code>,

954

<code class="computeroutput">SHR</code>,

955

<code class="computeroutput">SAR</code>,

956

<code class="computeroutput">ROL</code>,

957

<code class="computeroutput">ROR</code>,

958

<code class="computeroutput">RCL</code>,

959

<code class="computeroutput">RCR</code>,

960

<code class="computeroutput">NOT</code>,

961

<code class="computeroutput">NEG</code>,

962

<code class="computeroutput">INC</code>,

963

<code class="computeroutput">DEC</code>,

964

<code class="computeroutput">BSWAP</code>,

965

<code class="computeroutput">CC2VAL</code> and

966

<code class="computeroutput">WIDEN</code>.

967

<code class="computeroutput">WIDEN</code> does signed or

968

unsigned value widening.

969

<code class="computeroutput">CC2VAL</code> is used to convert

970

condition codes into a value, zero or one. The rest are

971

obvious.</p>

972

<p>To allow for more efficient code generation, we bend

973

slightly the restriction at the start of the previous para:

974

for <code class="computeroutput">ADD</code>,

975

<code class="computeroutput">ADC</code>,

976

<code class="computeroutput">XOR</code>,

977

<code class="computeroutput">SUB</code> and

978

<code class="computeroutput">SBB</code>, we allow the first

979

(source) operand to also be an

980

<code class="computeroutput">ArchReg</code>, that is, one of the

981

simulated machine's registers. Also, many of these ALU ops

982

allow the source operand to be a literal. See

983

<code class="computeroutput">VG_(saneUInstr)</code> for the

984

final word on the allowable forms of uinstrs.</p>

985

</li>

986

<li><p><code class="computeroutput">LEA1</code> and

987

<code class="computeroutput">LEA2</code> are not strictly

988

necessary, but allow faciliate better translations. They

989

record the fancy x86 addressing modes in a direct way, which

990

allows those amodes to be emitted back into the final

991

instruction stream more or less verbatim.</p></li>

992

<li>

993

<p><code class="computeroutput">CALLM</code> calls a

994

machine-code helper, one of the methods whose address is

995

stored at some

996

<code class="computeroutput">VG_(baseBlock)</code> offset.

997

<code class="computeroutput">PUSH</code> and

998

<code class="computeroutput">POP</code> move values to/from

999

<code class="computeroutput">TempReg</code> to the real

1000

(Valgrind's) stack, and

1001

<code class="computeroutput">CLEAR</code> removes values from

1002

the stack. <code class="computeroutput">CALLM_S</code> and

1003

<code class="computeroutput">CALLM_E</code> delimit the

1004

boundaries of call setups and clearings, for the benefit of

1005

the instrumentation passes. Getting this right is critical,

1006

and so <code class="computeroutput">VG_(saneUCodeBlock)</code>

1007

makes various checks on the use of these uopcodes.</p>

1008

<p>It is important to understand that these uopcodes have

1009

nothing to do with the x86

1010

<code class="computeroutput">call</code>,

1011

<code class="computeroutput">return,</code>

1012

<code class="computeroutput">push</code> or

1013

<code class="computeroutput">pop</code> instructions, and are

1014

not used to implement them. Those guys turn into

1015

combinations of <code class="computeroutput">GET</code>,

1016

<code class="computeroutput">PUT</code>,

1017

<code class="computeroutput">LOAD</code>,

1018

<code class="computeroutput">STORE</code>,

1019

<code class="computeroutput">ADD</code>,

1020

<code class="computeroutput">SUB</code>, and

1021

<code class="computeroutput">JMP</code>. What these uopcodes

1022

support is calling of helper functions such as

1023

<code class="computeroutput">VG_(helper_imul_32_64)</code>,

1024

which do stuff which is too difficult or tedious to emit

1025

inline.</p>

1026

</li>

1027

<li><p><code class="computeroutput">FPU</code>,

1028

<code class="computeroutput">FPU_R</code> and

1029

<code class="computeroutput">FPU_W</code>. Valgrind doesn't

1030

attempt to simulate the internal state of the FPU at all.

1031

Consequently it only needs to be able to distinguish FPU ops

1032

which read and write memory from those that don't, and for

1033

those which do, it needs to know the effective address and

1034

data transfer size. This is made easier because the x86 FP

1035

instruction encoding is very regular, basically consisting of

1036

16 bits for a non-memory FPU insn and 11 (IIRC) bits + an

1037

address mode for a memory FPU insn. So our

1038

<code class="computeroutput">FPU</code> uinstr carries the 16

1039

bits in its <code class="computeroutput">val1</code> field. And

1040

<code class="computeroutput">FPU_R</code> and

1041

<code class="computeroutput">FPU_W</code> carry 11 bits in that

1042

field, together with the identity of a

1043

<code class="computeroutput">TempReg</code> or (later)

1044

<code class="computeroutput">RealReg</code> which contains the

1045

address.</p></li>

1046

<li><p><code class="computeroutput">JIFZ</code> is unique, in

1047

that it allows a control-flow transfer which is not deemed to

1048

end a basic block. It causes a jump to a literal (original)

1049

address if the specified argument is zero.</p></li>

1050

<li><p>Finally, <code class="computeroutput">INCEIP</code>

1051

advances the simulated <code class="computeroutput">%EIP</code>

1052

by the specified literal amount. This supports lazy

1053

<code class="computeroutput">%EIP</code> updating, as described

1054

below.</p></li>

1055

</ul></div>

1056

<p>Stages 1 and 2 of the 6-stage translation process mentioned

1057

above deal purely with these uopcodes, and no others. They are

1058

sufficient to express pretty much all the x86 32-bit

1059

protected-mode instruction set, at least everything understood by

1060

a pre-MMX original Pentium (P54C).</p>

1061

<p>Stages 3, 4, 5 and 6 also deal with the following extra

1062

"instrumentation" uopcodes. They are used to express all the

1063

definedness-tracking and -checking machinery which valgrind does.

1064

In later sections we show how to create checking code for each of

1065

the uopcodes above. Note that these instrumentation uopcodes,

1066

although some appearing complicated, have been carefully chosen

1067

so that efficient x86 code can be generated for them. GNU

1068

superopt v2.5 did a great job helping out here. Anyways, the

1069

uopcodes are as follows:</p>

1070

1071

<li><p><code class="computeroutput">GETV</code> and

1072

<code class="computeroutput">PUTV</code> are analogues to

1073

<code class="computeroutput">GET</code> and

1074

<code class="computeroutput">PUT</code> above. They are

1075

identical except that they move the V bits for the specified

1076

values back and forth to

1077

<code class="computeroutput">TempRegs</code>, rather than moving

1078

the values themselves.</p></li>

1079

<li><p>Similarly, <code class="computeroutput">LOADV</code> and

1080

<code class="computeroutput">STOREV</code> read and write V bits

1081

from the synthesised shadow memory that Valgrind maintains.

1082

In fact they do more than that, since they also do

1083

address-validity checks, and emit complaints if the

1084

read/written addresses are unaddressible.</p></li>

1085

<li><p><code class="computeroutput">TESTV</code>, whose

1086

parameters are a <code class="computeroutput">TempReg</code> and

1087

a size, tests the V bits in the

1088

<code class="computeroutput">TempReg</code>, at the specified

1089

operation size (0/1/2/4 byte) and emits an error if any of

1090

them indicate undefinedness. This is the only uopcode

1091

capable of doing such tests.</p></li>

1092

<li><p><code class="computeroutput">SETV</code>, whose parameters

1093

are also <code class="computeroutput">TempReg</code> and a size,

1094

makes the V bits in the

1095

<code class="computeroutput">TempReg</code> indicated

1096

definedness, at the specified operation size. This is

1097

usually used to generate the correct V bits for a literal

1098

value, which is of course fully defined.</p></li>

1099

<li><p><code class="computeroutput">GETVF</code> and

1100

<code class="computeroutput">PUTVF</code> are analogues to

1101

<code class="computeroutput">GETF</code> and

1102

<code class="computeroutput">PUTF</code>. They move the single

1103

V bit used to model definedness of

1104

<code class="computeroutput">%EFLAGS</code> between its home in

1105

<code class="computeroutput">VG_(baseBlock)</code> and the

1106

specified <code class="computeroutput">TempReg</code>.</p></li>

1107

<li><p><code class="computeroutput">TAG1</code> denotes one of a

1108

family of unary operations on

1109

<code class="computeroutput">TempReg</code>s containing V bits.

1110

Similarly, <code class="computeroutput">TAG2</code> denotes one

1111

in a family of binary operations on V bits.</p></li>

1112

</ul></div>

1113

<p>These 10 uopcodes are sufficient to express Valgrind's

1114

entire definedness-checking semantics. In fact most of the

1115

interesting magic is done by the

1116

<code class="computeroutput">TAG1</code> and

1117

<code class="computeroutput">TAG2</code> suboperations.</p>

1118

<p>First, however, I need to explain about V-vector operation

1119

sizes. There are 4 sizes: 1, 2 and 4, which operate on groups of

1120

8, 16 and 32 V bits at a time, supporting the usual 1, 2 and 4

1121

byte x86 operations. However there is also the mysterious size

1122

0, which really means a single V bit. Single V bits are used in

1123

various circumstances; in particular, the definedness of

1124

<code class="computeroutput">%EFLAGS</code> is modelled with a

1125

single V bit. Now might be a good time to also point out that

1126

for V bits, 1 means "undefined" and 0 means "defined".

1127

Similarly, for A bits, 1 means "invalid address" and 0 means

1128

"valid address". This seems counterintuitive (and so it is), but

1129

testing against zero on x86s saves instructions compared to

1130

testing against all 1s, because many ALU operations set the Z

1131

flag for free, so to speak.</p>

1132

1133

1134

<li>

1135

<p><b>(UNARY) Pessimising casts:�</b><code class="computeroutput">VgT_PCast40</code>,

1136

<code class="computeroutput">VgT_PCast20</code>,

1137

<code class="computeroutput">VgT_PCast10</code>,

1138

<code class="computeroutput">VgT_PCast01</code>,

1139

<code class="computeroutput">VgT_PCast02</code> and

1140

<code class="computeroutput">VgT_PCast04</code>. A "pessimising

1141

cast" takes a V-bit vector at one size, and creates a new one

1142

at another size, pessimised in the sense that if any of the

1143

bits in the source vector indicate undefinedness, then all

1144

the bits in the result indicate undefinedness. In this case

1145

the casts are all to or from a single V bit, so for example

1146

<code class="computeroutput">VgT_PCast40</code> is a pessimising

1147

cast from 32 bits to 1, whereas

1148

<code class="computeroutput">VgT_PCast04</code> simply copies

1149

the single source V bit into all 32 bit positions in the

1150

result. Surprisingly, these ops can all be implemented very

1151

efficiently.</p>

1152

<p>There are also the pessimising casts

1153

<code class="computeroutput">VgT_PCast14</code>, from 8 bits to

1154

32, <code class="computeroutput">VgT_PCast12</code>, from 8 bits

1155

to 16, and <code class="computeroutput">VgT_PCast11</code>, from

1156

8 bits to 8. This last one seems nonsensical, but in fact it

1157

isn't a no-op because, as mentioned above, any undefined (1)

1158

bits in the source infect the entire result.</p>

1159

</li>

1160

<li><p><b>(UNARY) Propagating undefinedness upwards in a

1161

word:�</b><code class="computeroutput">VgT_Left4</code>,

1162

<code class="computeroutput">VgT_Left2</code> and

1163

<code class="computeroutput">VgT_Left1</code>. These are used

1164

to simulate the worst-case effects of carry propagation in

1165

adds and subtracts. They return a V vector identical to the

1166

original, except that if the original contained any undefined

1167

bits, then it and all bits above it are marked as undefined

1168

too. Hence the Left bit in the names.</p></li>

1169

<li><p><b>(UNARY) Signed and unsigned value widening:�</b><code class="computeroutput">VgT_SWiden14</code>,

1170

<code class="computeroutput">VgT_SWiden24</code>,

1171

<code class="computeroutput">VgT_SWiden12</code>,

1172

<code class="computeroutput">VgT_ZWiden14</code>,

1173

<code class="computeroutput">VgT_ZWiden24</code> and

1174

<code class="computeroutput">VgT_ZWiden12</code>. These mimic

1175

the definedness effects of standard signed and unsigned

1176

integer widening. Unsigned widening creates zero bits in the

1177

new positions, so

1178

<code class="computeroutput">VgT_ZWiden*</code> accordingly park

1179

mark those parts of their argument as defined. Signed

1180

widening copies the sign bit into the new positions, so

1181

<code class="computeroutput">VgT_SWiden*</code> copies the

1182

definedness of the sign bit into the new positions. Because

1183

1 means undefined and 0 means defined, these operations can

1184

(fascinatingly) be done by the same operations which they

1185

mimic. Go figure.</p></li>

1186

<li><p><b>(BINARY) Undefined-if-either-Undefined,

1187

Defined-if-either-Defined:�</b><code class="computeroutput">VgT_UifU4</code>,

1188

<code class="computeroutput">VgT_UifU2</code>,

1189

<code class="computeroutput">VgT_UifU1</code>,

1190

<code class="computeroutput">VgT_UifU0</code>,

1191

<code class="computeroutput">VgT_DifD4</code>,

1192

<code class="computeroutput">VgT_DifD2</code>,

1193

<code class="computeroutput">VgT_DifD1</code>. These do simple

1194

bitwise operations on pairs of V-bit vectors, with

1195

<code class="computeroutput">UifU</code> giving undefined if

1196

either arg bit is undefined, and

1197

<code class="computeroutput">DifD</code> giving defined if

1198

either arg bit is defined. Abstract interpretation junkies,

1199

if any make it this far, may like to think of them as meets

1200

and joins (or is it joins and meets) in the definedness

1201

lattices.</p></li>

1202

<li>

1203

<p><b>(BINARY; one value, one V bits) Generate argument

1204

improvement terms for AND and OR.�</b><code class="computeroutput">VgT_ImproveAND4_TQ</code>,

1205

<code class="computeroutput">VgT_ImproveAND2_TQ</code>,

1206

<code class="computeroutput">VgT_ImproveAND1_TQ</code>,

1207

<code class="computeroutput">VgT_ImproveOR4_TQ</code>,

1208

<code class="computeroutput">VgT_ImproveOR2_TQ</code>,

1209

<code class="computeroutput">VgT_ImproveOR1_TQ</code>. These

1210

help out with AND and OR operations. AND and OR have the

1211

inconvenient property that the definedness of the result

1212

depends on the actual values of the arguments as well as

1213

their definedness. At the bit level:</p>

1214

1215

1 AND undefined = undefined, but

1216

0 AND undefined = 0, and

1217

similarly

1218

0 OR undefined = undefined, but

1219

1 OR undefined = 1.</pre>

1220

<p>It turns out that gcc (quite legitimately) generates

1221

code which relies on this fact, so we have to model it

1222

properly in order to avoid flooding users with spurious value

1223

errors. The ultimate definedness result of AND and OR is

1224

calculated using <code class="computeroutput">UifU</code> on the

1225

definedness of the arguments, but we also

1226

<code class="computeroutput">DifD</code> in some "improvement"

1227

terms which take into account the above phenomena.</p>

1228

<p><code class="computeroutput">ImproveAND</code> takes as

1229

its first argument the actual value of an argument to AND

1230

(the T) and the definedness of that argument (the Q), and

1231

returns a V-bit vector which is defined (0) for bits which

1232

have value 0 and are defined; this, when

1233

<code class="computeroutput">DifD</code> into the final result

1234

causes those bits to be defined even if the corresponding bit

1235

in the other argument is undefined.</p>

1236

<p>The <code class="computeroutput">ImproveOR</code> ops do

1237

the dual thing for OR arguments. Note that XOR does not have

1238

this property that one argument can make the other

1239

irrelevant, so there is no need for such complexity for

1240

XOR.</p>

1241

</li>

1242

</ul></div>

1243

<p>That's all the tag ops. If you stare at this long enough,

1244

and then run Valgrind and stare at the pre- and post-instrumented

1245

ucode, it should be fairly obvious how the instrumentation

1246

machinery hangs together.</p>

1247

<p>One point, if you do this: in order to make it easy to

1248

differentiate <code class="computeroutput">TempReg</code>s carrying

1249

values from <code class="computeroutput">TempReg</code>s carrying V

1250

bit vectors, Valgrind prints the former as (for example)

1251

<code class="computeroutput">t28</code> and the latter as

1252

<code class="computeroutput">q28</code>; the fact that they carry

1253

the same number serves to indicate their relationship. This is

1254

purely for the convenience of the human reader; the register

1255

allocator and code generator don't regard them as

1256

different.</p>

1257

</div>

1258

1259

1260

<a name="mc-tech-docs.trans"></a>1.2.6.�Translation into UCode</h3></div></div></div>

1261

<p><code class="computeroutput">VG_(disBB)</code> allocates a new

1262

<code class="computeroutput">UCodeBlock</code> and then uses

1263

<code class="computeroutput">disInstr</code> to translate x86

1264

instructions one at a time into UCode, dumping the result in the

1265

<code class="computeroutput">UCodeBlock</code>. This goes on until

1266

a control-flow transfer instruction is encountered.</p>

1267

<p>Despite the large size of

1268

<code class="filename">vg_to_ucode.c</code>, this translation is really

1269

very simple. Each x86 instruction is translated entirely

1270

independently of its neighbours, merrily allocating new

1271

<code class="computeroutput">TempReg</code>s as it goes. The idea

1272

is to have a simple translator -- in reality, no more than a

1273

macro-expander -- and the -- resulting bad UCode translation is

1274

cleaned up by the UCode optimisation phase which follows. To

1275

give you an idea of some x86 instructions and their translations

1276

(this is a complete basic block, as Valgrind sees it):</p>

1277

1278

0x40435A50: incl %edx

1279

0: GETL %EDX, t0

1280

1: INCL t0 (-wOSZAP)

1281

2: PUTL t0, %EDX

1282

1283

0x40435A51: movsbl (%edx),%eax

1284

3: GETL %EDX, t2

1285

4: LDB (t2), t2

1286

5: WIDENL_Bs t2

1287

6: PUTL t2, %EAX

1288

1289

0x40435A54: testb $0x20, 1(%ecx,%eax,2)

1290

7: GETL %EAX, t6

1291

8: GETL %ECX, t8

1292

9: LEA2L 1(t8,t6,2), t4

1293

10: LDB (t4), t10

1294

11: MOVB $0x20, t12

1295

12: ANDB t12, t10 (-wOSZACP)

1296

13: INCEIPo $9

1297

1298

0x40435A59: jnz-8 0x40435A50

1299

14: Jnzo $0x40435A50 (-rOSZACP)

1300

15: JMPo $0x40435A5B</pre>

1301

<p>Notice how the block always ends with an unconditional jump

1302

to the next block. This is a bit unnecessary, but makes many

1303

things simpler.</p>

1304

<p>Most x86 instructions turn into sequences of

1305

<code class="computeroutput">GET</code>,

1306

<code class="computeroutput">PUT</code>,

1307

<code class="computeroutput">LEA1</code>,

1308

<code class="computeroutput">LEA2</code>,

1309

<code class="computeroutput">LOAD</code> and

1310

<code class="computeroutput">STORE</code>. Some complicated ones

1311

however rely on calling helper bits of code in

1312

<code class="filename">vg_helpers.S</code>. The ucode instructions

1313

<code class="computeroutput">PUSH</code>,

1314

<code class="computeroutput">POP</code>,

1315

<code class="computeroutput">CALL</code>,

1316

<code class="computeroutput">CALLM_S</code> and

1317

<code class="computeroutput">CALLM_E</code> support this. The

1318

calling convention is somewhat ad-hoc and is not the C calling

1319

convention. The helper routines must save all integer registers,

1320

and the flags, that they use. Args are passed on the stack

1321

underneath the return address, as usual, and if result(s) are to

1322

be returned, it (they) are either placed in dummy arg slots

1323

created by the ucode <code class="computeroutput">PUSH</code>

1324

sequence, or just overwrite the incoming args.</p>

1325

<p>In order that the instrumentation mechanism can handle

1326

calls to these helpers,

1327

<code class="computeroutput">VG_(saneUCodeBlock)</code> enforces the

1328

following restrictions on calls to helpers:</p>

1329

1330

<li><p>Each <code class="computeroutput">CALL</code> uinstr must

1331

be bracketed by a preceding

1332

<code class="computeroutput">CALLM_S</code> marker (dummy

1333

uinstr) and a trailing

1334

<code class="computeroutput">CALLM_E</code> marker. These

1335

markers are used by the instrumentation mechanism later to

1336

establish the boundaries of the

1337

<code class="computeroutput">PUSH</code>,

1338

<code class="computeroutput">POP</code> and

1339

<code class="computeroutput">CLEAR</code> sequences for the

1340

call.</p></li>

1341

<li><p><code class="computeroutput">PUSH</code>,

1342

<code class="computeroutput">POP</code> and

1343

<code class="computeroutput">CLEAR</code> may only appear inside

1344

sections bracketed by

1345

<code class="computeroutput">CALLM_S</code> and

1346

<code class="computeroutput">CALLM_E</code>, and nowhere else.</p></li>

1347

<li><p>In any such bracketed section, no two

1348

<code class="computeroutput">PUSH</code> insns may push the same

1349

<code class="computeroutput">TempReg</code>. Dually, no two two

1350

<code class="computeroutput">POP</code>s may pop the same

1351

<code class="computeroutput">TempReg</code>.</p></li>

1352

<li><p>Finally, although this is not checked, args should be

1353

removed from the stack with

1354

<code class="computeroutput">CLEAR</code>, rather than

1355

<code class="computeroutput">POP</code>s into a

1356

<code class="computeroutput">TempReg</code> which is not

1357

subsequently used. This is because the instrumentation

1358

mechanism assumes that all values

1359

<code class="computeroutput">POP</code>ped from the stack are

1360

actually used.</p></li>

1361

</ul></div>

1362

<p>Some of the translations may appear to have redundant

1363

<code class="computeroutput">TempReg</code>-to-<code class="computeroutput">TempReg</code>

1364

moves. This helps the next phase, UCode optimisation, to

1365

generate better code.</p>

1366

</div>

1367

1368

1369

<a name="mc-tech-docs.optim"></a>1.2.7.�UCode optimisation</h3></div></div></div>

1370

<p>UCode is then subjected to an improvement pass

1371

(<code class="computeroutput">vg_improve()</code>), which blurs the

1372

boundaries between the translations of the original x86

1373

instructions. It's pretty straightforward. Three

1374

transformations are done:</p>

1375

1376

<li>

1377

<p>Redundant <code class="computeroutput">GET</code>

1378

elimination. Actually, more general than that -- eliminates

1379

redundant fetches of ArchRegs. In our running example,

1380

uinstr 3 <code class="computeroutput">GET</code>s

1381

<code class="computeroutput">%EDX</code> into

1382

<code class="computeroutput">t2</code> despite the fact that, by

1383

looking at the previous uinstr, it is already in

1384

<code class="computeroutput">t0</code>. The

1385

<code class="computeroutput">GET</code> is therefore removed,

1386

and <code class="computeroutput">t2</code> renamed to

1387

<code class="computeroutput">t0</code>. Assuming

1388

<code class="computeroutput">t0</code> is allocated to a host

1389

1390

<code class="computeroutput">%EDX</code> will exist in a host

1391

CPU register for more than one simulated x86 instruction,

1392

which seems to me to be a highly desirable property.</p>

1393

<p>There is some mucking around to do with subregisters;

1394

<code class="computeroutput">%AL</code> vs

1395

1396

<code class="computeroutput">%AX</code> vs

1397

<code class="computeroutput">%EAX</code> etc. I can't remember

1398

how it works, but in general we are very conservative, and

1399

these tend to invalidate the caching.</p>

1400

</li>

1401

<li>

1402

<p>Redundant <code class="computeroutput">PUT</code>

1403

elimination. This annuls

1404

<code class="computeroutput">PUT</code>s of values back to

1405

simulated CPU registers if a later

1406

<code class="computeroutput">PUT</code> would overwrite the

1407

earlier <code class="computeroutput">PUT</code> value, and there

1408

is no intervening reads of the simulated register

1409

(<code class="computeroutput">ArchReg</code>).</p>

1410

<p>As before, we are paranoid when faced with subregister

1411

references. Also, <code class="computeroutput">PUT</code>s of

1412

<code class="computeroutput">%ESP</code> are never annulled,

1413

because it is vital the instrumenter always has an up-to-date

1414

<code class="computeroutput">%ESP</code> value available,

1415

<code class="computeroutput">%ESP</code> changes affect

1416

addressibility of the memory around the simulated stack

1417

pointer.</p>

1418

<p>The implication of the above paragraph is that the

1419

simulated machine's registers are only lazily updated once

1420

the above two optimisation phases have run, with the

1421

exception of <code class="computeroutput">%ESP</code>.

1422

<code class="computeroutput">TempReg</code>s go dead at the end

1423

of every basic block, from which is is inferrable that any

1424

<code class="computeroutput">TempReg</code> caching a simulated

1425

CPU reg is flushed (back into the relevant

1426

<code class="computeroutput">VG_(baseBlock)</code> slot) at the

1427

end of every basic block. The further implication is that

1428

the simulated registers are only up-to-date at in between

1429

basic blocks, and not at arbitrary points inside basic

1430

blocks. And the consequence of that is that we can only

1431

deliver signals to the client in between basic blocks. None

1432

of this seems any problem in practice.</p>

1433

</li>

1434

<li><p>Finally there is a simple def-use thing for condition

1435

codes. If an earlier uinstr writes the condition codes, and

1436

the next uinsn along which actually cares about the condition

1437

codes writes the same or larger set of them, but does not

1438

read any, the earlier uinsn is marked as not writing any

1439

condition codes. This saves a lot of redundant cond-code

1440

saving and restoring.</p></li>

1441

</ul></div>

1442

<p>The effect of these transformations on our short block is

1443

rather unexciting, and shown below. On longer basic blocks they

1444

can dramatically improve code quality.</p>

1445

1446

at 3: delete GET, rename t2 to t0 in (4 .. 6)

1447

at 7: delete GET, rename t6 to t0 in (8 .. 9)

1448

at 1: annul flag write OSZAP due to later OSZACP

1449

1450

Improved code:

1451

0: GETL %EDX, t0

1452

1: INCL t0

1453

2: PUTL t0, %EDX

1454

4: LDB (t0), t0

1455

5: WIDENL_Bs t0

1456

6: PUTL t0, %EAX

1457

8: GETL %ECX, t8

1458

9: LEA2L 1(t8,t0,2), t4

1459

10: LDB (t4), t10

1460

11: MOVB $0x20, t12

1461

12: ANDB t12, t10 (-wOSZACP)

1462

13: INCEIPo $9

1463

14: Jnzo $0x40435A50 (-rOSZACP)

1464

15: JMPo $0x40435A5B</pre>

1465

</div>

1466

1467

1468

<a name="mc-tech-docs.instrum"></a>1.2.8.�UCode instrumentation</h3></div></div></div>

1469

<p>Once you understand the meaning of the instrumentation

1470

uinstrs, discussed in detail above, the instrumentation scheme is

1471

fairly straightforward. Each uinstr is instrumented in

1472

isolation, and the instrumentation uinstrs are placed before the

1473

original uinstr. Our running example continues below. I have

1474

placed a blank line after every original ucode, to make it easier

1475

to see which instrumentation uinstrs correspond to which

1476

originals.</p>

1477

<p>As mentioned somewhere above,

1478

<code class="computeroutput">TempReg</code>s carrying values have

1479

names like <code class="computeroutput">t28</code>, and each one has

1480

a shadow carrying its V bits, with names like

1481

<code class="computeroutput">q28</code>. This pairing aids in

1482

reading instrumented ucode.</p>

1483

<p>One decision about all this is where to have "observation

1484

points", that is, where to check that V bits are valid. I use a

1485

minimalistic scheme, only checking where a failure of validity

1486

could cause the original program to (seg)fault. So the use of

1487

values as memory addresses causes a check, as do conditional

1488

jumps (these cause a check on the definedness of the condition

1489

codes). And arguments <code class="computeroutput">PUSH</code>ed

1490

for helper calls are checked, hence the weird restrictions on

1491

help call preambles described above.</p>

1492

<p>Another decision is that once a value is tested, it is

1493

thereafter regarded as defined, so that we do not emit multiple

1494

undefined-value errors for the same undefined value. That means

1495

that <code class="computeroutput">TESTV</code> uinstrs are always

1496

followed by <code class="computeroutput">SETV</code> on the same

1497

(shadow) <code class="computeroutput">TempReg</code>s. Most of

1498

these <code class="computeroutput">SETV</code>s are redundant and

1499

are removed by the post-instrumentation cleanup phase.</p>

1500

<p>The instrumentation for calling helper functions deserves

1501

further comment. The definedness of results from a helper is

1502

modelled using just one V bit. So, in short, we do pessimising

1503

casts of the definedness of all the args, down to a single bit,

1504

and then <code class="computeroutput">UifU</code> these bits

1505

together. So this single V bit will say "undefined" if any part

1506

of any arg is undefined. This V bit is then pessimally cast back

1507

up to the result(s) sizes, as needed. If, by seeing that all the

1508

args are got rid of with <code class="computeroutput">CLEAR</code>

1509

and none with <code class="computeroutput">POP</code>, Valgrind sees

1510

that the result of the call is not actually used, it immediately

1511

examines the result V bit with a

1512

<code class="computeroutput">TESTV</code> --

1513

<code class="computeroutput">SETV</code> pair. If it did not do

1514

this, there would be no observation point to detect that the some

1515

of the args to the helper were undefined. Of course, if the

1516

helper's results are indeed used, we don't do this, since the

1517

result usage will presumably cause the result definedness to be

1518

checked at some suitable future point.</p>

1519

<p>In general Valgrind tries to track definedness on a

1520

bit-for-bit basis, but as the above para shows, for calls to

1521

helpers we throw in the towel and approximate down to a single

1522

bit. This is because it's too complex and difficult to track

1523

bit-level definedness through complex ops such as integer

1524

multiply and divide, and in any case there is no reasonable code

1525

fragments which attempt to (eg) multiply two partially-defined

1526

values and end up with something meaningful, so there seems

1527

little point in modelling multiplies, divides, etc, in that level

1528

of detail.</p>

1529

<p>Integer loads and stores are instrumented with firstly a

1530

test of the definedness of the address, followed by a

1531

<code class="computeroutput">LOADV</code> or

1532

<code class="computeroutput">STOREV</code> respectively. These turn

1533

into calls to (for example)

1534

<code class="computeroutput">VG_(helperc_LOADV4)</code>. These

1535

helpers do two things: they perform an address-valid check, and

1536

they load or store V bits from/to the relevant address in the

1537

(simulated V-bit) memory.</p>

1538

<p>FPU loads and stores are different. As above the

1539

definedness of the address is first tested. However, the helper

1540

routine for FPU loads

1541

(<code class="computeroutput">VGM_(fpu_read_check)</code>) emits an

1542

error if either the address is invalid or the referenced area

1543

contains undefined values. It has to do this because we do not

1544

simulate the FPU at all, and so cannot track definedness of

1545

values loaded into it from memory, so we have to check them as

1546

soon as they are loaded into the FPU, ie, at this point. We

1547

notionally assume that everything in the FPU is defined.</p>

1548

<p>It follows therefore that FPU writes first check the

1549

definedness of the address, then the validity of the address, and

1550

finally mark the written bytes as well-defined.</p>

1551

<p>If anyone is inspired to extend Valgrind to MMX/SSE insns,

1552

I suggest you use the same trick. It works provided that the

1553

FPU/MMX unit is not used to merely as a conduit to copy partially

1554

undefined data from one place in memory to another.

1555

Unfortunately the integer CPU is used like that (when copying C

1556

structs with holes, for example) and this is the cause of much of

1557

the elaborateness of the instrumentation here described.</p>

1558

<p><code class="computeroutput">vg_instrument()</code> in

1559

<code class="filename">vg_translate.c</code> actually does the

1560

instrumentation. There are comments explaining how each uinstr

1561

is handled, so we do not repeat that here. As explained already,

1562

it is bit-accurate, except for calls to helper functions.

1563

Unfortunately the x86 insns

1564

<code class="computeroutput">bt/bts/btc/btr</code> are done by

1565

helper fns, so bit-level accuracy is lost there. This should be

1566

fixed by doing them inline; it will probably require adding a

1567

couple new uinstrs. Also, left and right rotates through the

1568

carry flag (x86 <code class="computeroutput">rcl</code> and

1569

<code class="computeroutput">rcr</code>) are approximated via a

1570

single V bit; so far this has not caused anyone to complain. The

1571

non-carry rotates, <code class="computeroutput">rol</code> and

1572

<code class="computeroutput">ror</code>, are much more common and

1573

are done exactly. Re-visiting the instrumentation for AND and

1574

OR, they seem rather verbose, and I wonder if it could be done

1575

more concisely now.</p>

1576

<p>The lowercase <code class="computeroutput">o</code> on many of

1577

the uopcodes in the running example indicates that the size field

1578

is zero, usually meaning a single-bit operation.</p>

1579

<p>Anyroads, the post-instrumented version of our running

1580

example looks like this:</p>

1581

1582

Instrumented code:

1583

0: GETVL %EDX, q0

1584

1: GETL %EDX, t0

1585

1586

2: TAG1o q0 = Left4 ( q0 )

1587

3: INCL t0

1588

1589

4: PUTVL q0, %EDX

1590

5: PUTL t0, %EDX

1591

1592

6: TESTVL q0

1593

7: SETVL q0

1594

8: LOADVB (t0), q0

1595

9: LDB (t0), t0

1596

1597

10: TAG1o q0 = SWiden14 ( q0 )

1598

11: WIDENL_Bs t0

1599

1600

12: PUTVL q0, %EAX

1601

13: PUTL t0, %EAX

1602

1603

14: GETVL %ECX, q8

1604

15: GETL %ECX, t8

1605

1606

16: MOVL q0, q4

1607

17: SHLL $0x1, q4

1608

18: TAG2o q4 = UifU4 ( q8, q4 )

1609

19: TAG1o q4 = Left4 ( q4 )

1610

20: LEA2L 1(t8,t0,2), t4

1611

1612

21: TESTVL q4

1613

22: SETVL q4

1614

23: LOADVB (t4), q10

1615

24: LDB (t4), t10

1616

1617

25: SETVB q12

1618

26: MOVB $0x20, t12

1619

1620

27: MOVL q10, q14

1621

28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )

1622

29: TAG2o q10 = UifU1 ( q12, q10 )

1623

30: TAG2o q10 = DifD1 ( q14, q10 )

1624

31: MOVL q12, q14

1625

32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 )

1626

33: TAG2o q10 = DifD1 ( q14, q10 )

1627

34: MOVL q10, q16

1628

35: TAG1o q16 = PCast10 ( q16 )

1629

36: PUTVFo q16

1630

37: ANDB t12, t10 (-wOSZACP)

1631

1632

38: INCEIPo $9

1633

1634

39: GETVFo q18

1635

40: TESTVo q18

1636

41: SETVo q18

1637

42: Jnzo $0x40435A50 (-rOSZACP)

1638

1639

43: JMPo $0x40435A5B</pre>

1640

</div>

1641

1642

1643

<a name="mc-tech-docs.cleanup"></a>1.2.9.�UCode post-instrumentation cleanup</h3></div></div></div>

1644

<p>This pass, coordinated by

1645

<code class="computeroutput">vg_cleanup()</code>, removes redundant

1646

definedness computation created by the simplistic instrumentation

1647

pass. It consists of two passes,

1648

<code class="computeroutput">vg_propagate_definedness()</code>

1649

followed by

1650

<code class="computeroutput">vg_delete_redundant_SETVs</code>.</p>

1651

<p><code class="computeroutput">vg_propagate_definedness()</code>

1652

is a simple constant-propagation and constant-folding pass. It

1653

tries to determine which

1654

<code class="computeroutput">TempReg</code>s containing V bits will

1655

always indicate "fully defined", and it propagates this

1656

information as far as it can, and folds out as many operations as

1657

possible. For example, the instrumentation for an ADD of a

1658

literal to a variable quantity will be reduced down so that the

1659

definedness of the result is simply the definedness of the

1660

variable quantity, since the literal is by definition fully

1661

defined.</p>

1662

<p><code class="computeroutput">vg_delete_redundant_SETVs</code>

1663

removes <code class="computeroutput">SETV</code>s on shadow

1664

<code class="computeroutput">TempReg</code>s for which the next

1665

action is a write. I don't think there's anything else worth

1666

saying about this; it is simple. Read the sources for

1667

details.</p>

1668

<p>So the cleaned-up running example looks like this. As

1669

above, I have inserted line breaks after every original

1670

(non-instrumentation) uinstr to aid readability. As with

1671

straightforward ucode optimisation, the results in this block are

1672

undramatic because it is so short; longer blocks benefit more

1673

because they have more redundancy which gets eliminated.</p>

1674

1675

at 29: delete UifU1 due to defd arg1

1676

at 32: change ImproveAND1_TQ to MOV due to defd arg2

1677

at 41: delete SETV

1678

at 31: delete MOV

1679

at 25: delete SETV

1680

at 22: delete SETV

1681

at 7: delete SETV

1682

1683

0: GETVL %EDX, q0

1684

1: GETL %EDX, t0

1685

1686

2: TAG1o q0 = Left4 ( q0 )

1687

3: INCL t0

1688

1689

4: PUTVL q0, %EDX

1690

5: PUTL t0, %EDX

1691

1692

6: TESTVL q0

1693

8: LOADVB (t0), q0

1694

9: LDB (t0), t0

1695

1696

10: TAG1o q0 = SWiden14 ( q0 )

1697

11: WIDENL_Bs t0

1698

1699

12: PUTVL q0, %EAX

1700

13: PUTL t0, %EAX

1701

1702

14: GETVL %ECX, q8

1703

15: GETL %ECX, t8

1704

1705

16: MOVL q0, q4

1706

17: SHLL $0x1, q4

1707

18: TAG2o q4 = UifU4 ( q8, q4 )

1708

19: TAG1o q4 = Left4 ( q4 )

1709

20: LEA2L 1(t8,t0,2), t4

1710

1711

21: TESTVL q4

1712

23: LOADVB (t4), q10

1713

24: LDB (t4), t10

1714

1715

26: MOVB $0x20, t12

1716

1717

27: MOVL q10, q14

1718

28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )

1719

30: TAG2o q10 = DifD1 ( q14, q10 )

1720

32: MOVL t12, q14

1721

33: TAG2o q10 = DifD1 ( q14, q10 )

1722

34: MOVL q10, q16

1723

35: TAG1o q16 = PCast10 ( q16 )

1724

36: PUTVFo q16

1725

37: ANDB t12, t10 (-wOSZACP)

1726

1727

38: INCEIPo $9

1728

39: GETVFo q18

1729

40: TESTVo q18

1730

42: Jnzo $0x40435A50 (-rOSZACP)

1731

1732

43: JMPo $0x40435A5B</pre>

1733

</div>

1734

1735

1736

<a name="mc-tech-docs.transfrom"></a>1.2.10.�Translation from UCode</h3></div></div></div>

1737

<p>This is all very simple, even though

1738

<code class="filename">vg_from_ucode.c</code> is a big file.

1739

Position-independent x86 code is generated into a dynamically

1740

allocated array <code class="computeroutput">emitted_code</code>;

1741

this is doubled in size when it overflows. Eventually the array

1742

is handed back to the caller of

1743

<code class="computeroutput">VG_(translate)</code>, who must copy

1744

the result into TC and TT, and free the array.</p>

1745

<p>This file is structured into four layers of abstraction,

1746

which, thankfully, are glued back together with extensive

1747

<code class="computeroutput">__inline__</code> directives. From the

1748

bottom upwards:</p>

1749

1750

<li><p>Address-mode emitters,

1751

<code class="computeroutput">emit_amode_regmem_reg</code> et

1752

al.</p></li>

1753

<li><p>Emitters for specific x86 instructions. There are

1754

quite a lot of these, with names such as

1755

<code class="computeroutput">emit_movv_offregmem_reg</code>.

1756

The <code class="computeroutput">v</code> suffix is Intel

1757

parlance for a 16/32 bit insn; there are also

1758

<code class="computeroutput">b</code> suffixes for 8 bit

1759

insns.</p></li>

1760

<li><p>The next level up are the

1761

<code class="computeroutput">synth_*</code> functions, which

1762

synthesise possibly a sequence of raw x86 instructions to do

1763

some simple task. Some of these are quite complex because

1764

they have to work around Intel's silly restrictions on

1765

subregister naming. See

1766

<code class="computeroutput">synth_nonshiftop_reg_reg</code> for

1767

example.</p></li>

1768

<li><p>Finally, at the top of the heap, we have

1769

<code class="computeroutput">emitUInstr()</code>, which emits

1770

code for a single uinstr.</p></li>

1771

</ul></div>

1772

<p>Some comments:</p>

1773

1774

<li>

1775

<p>The hack for FPU instructions becomes apparent here.

1776

To do a <code class="computeroutput">FPU</code> ucode

1777

instruction, we load the simulated FPU's state into from its

1778

<code class="computeroutput">VG_(baseBlock)</code> into the real

1779

FPU using an x86 <code class="computeroutput">frstor</code>

1780

insn, do the ucode <code class="computeroutput">FPU</code> insn

1781

on the real CPU, and write the updated FPU state back into

1782

<code class="computeroutput">VG_(baseBlock)</code> using an

1783

<code class="computeroutput">fnsave</code> instruction. This is

1784

pretty brutal, but is simple and it works, and even seems

1785

tolerably efficient. There is no attempt to cache the

1786

simulated FPU state in the real FPU over multiple

1787

back-to-back ucode FPU instructions.</p>

1788

<p><code class="computeroutput">FPU_R</code> and

1789

<code class="computeroutput">FPU_W</code> are also done this

1790

way, with the minor complication that we need to patch in

1791

some addressing mode bits so the resulting insn knows the

1792

effective address to use. This is easy because of the

1793

regularity of the x86 FPU instruction encodings.</p>

1794

</li>

1795

<li><p>An analogous trick is done with ucode insns which

1796

claim, in their <code class="computeroutput">flags_r</code> and

1797

<code class="computeroutput">flags_w</code> fields, that they

1798

read or write the simulated

1799

<code class="computeroutput">%EFLAGS</code>. For such cases we

1800

first copy the simulated

1801

<code class="computeroutput">%EFLAGS</code> into the real

1802

<code class="computeroutput">%eflags</code>, then do the insn,

1803

then, if the insn says it writes the flags, copy back to

1804

<code class="computeroutput">%EFLAGS</code>. This is a bit

1805

expensive, which is why the ucode optimisation pass goes to

1806

some effort to remove redundant flag-update annotations.</p></li>

1807

</ul></div>

1808

<p>And so ... that's the end of the documentation for the

1809

instrumentating translator! It's really not that complex,

1810

because it's composed as a sequence of simple(ish) self-contained

1811

transformations on straight-line blocks of code.</p>

1812

</div>

1813

1814

1815

<a name="mc-tech-docs.dispatch"></a>1.2.11.�Top-level dispatch loop</h3></div></div></div>

1816

<p>Urk. In <code class="computeroutput">VG_(toploop)</code>.

1817

This is basically boring and unsurprising, not to mention fiddly

1818

and fragile. It needs to be cleaned up.</p>

1819

<p>The only perhaps surprise is that the whole thing is run on

1820

top of a <code class="computeroutput">setjmp</code>-installed

1821

exception handler, because, supposing a translation got a

1822

segfault, we have to bail out of the Valgrind-supplied exception

1823

handler <code class="computeroutput">VG_(oursignalhandler)</code>

1824

and immediately start running the client's segfault handler, if

1825

it has one. In particular we can't finish the current basic

1826

block and then deliver the signal at some convenient future

1827

point, because signals like SIGILL, SIGSEGV and SIGBUS mean that

1828

the faulting insn should not simply be re-tried. (I'm sure there

1829

is a clearer way to explain this).</p>

1830

</div>

1831

1832

1833

<a name="mc-tech-docs.lazy"></a>1.2.12.�Lazy updates of the simulated program counter</h3></div></div></div>

1834

<p>Simulated <code class="computeroutput">%EIP</code> is not

1835

updated after every simulated x86 insn as this was regarded as

1836

too expensive. Instead ucode

1837

<code class="computeroutput">INCEIP</code> insns move it along as

1838

and when necessary. Currently we don't allow it to fall more

1839

than 4 bytes behind reality (see

1840

<code class="computeroutput">VG_(disBB)</code> for the way this

1841

works).</p>

1842

<p>Note that <code class="computeroutput">%EIP</code> is always

1843

brought up to date by the inner dispatch loop in

1844

<code class="computeroutput">VG_(dispatch)</code>, so that if the

1845

client takes a fault we know at least which basic block this

1846

happened in.</p>

1847

</div>

1848

1849

1850

<a name="mc-tech-docs.signals"></a>1.2.13.�Signals</h3></div></div></div>

1851

<p>Horrible, horrible. <code class="filename">vg_signals.c</code>.

1852

Basically, since we have to intercept all system calls anyway, we

1853

can see when the client tries to install a signal handler. If it

1854

does so, we make a note of what the client asked to happen, and

1855

ask the kernel to route the signal to our own signal handler,

1856

<code class="computeroutput">VG_(oursignalhandler)</code>. This

1857

simply notes the delivery of signals, and returns.</p>

1858

<p>Every 1000 basic blocks, we see if more signals have

1859

arrived. If so,

1860

<code class="computeroutput">VG_(deliver_signals)</code> builds

1861

signal delivery frames on the client's stack, and allows their

1862

handlers to be run. Valgrind places in these signal delivery

1863

frames a bogus return address,

1864

<code class="computeroutput">VG_(signalreturn_bogusRA)</code>, and

1865

checks all jumps to see if any jump to it. If so, this is a sign

1866

that a signal handler is returning, and if so Valgrind removes

1867

the relevant signal frame from the client's stack, restores the

1868

from the signal frame the simulated state before the signal was

1869

delivered, and allows the client to run onwards. We have to do

1870

it this way because some signal handlers never return, they just

1871

<code class="computeroutput">longjmp()</code>, which nukes the

1872

signal delivery frame.</p>

1873

<p>The Linux kernel has a different but equally horrible hack

1874

for detecting signal handler returns. Discovering it is left as

1875

an exercise for the reader.</p>

1876

</div>

1877

1878

1879

<a name="mc-tech-docs.todo"></a>1.2.14.�To be written</h3></div></div></div>

1880

<p>The following is a list of as-yet-not-written stuff. Apologies.</p>

1881

1882

<li><p>The translation cache and translation table</p></li>

1883

<li><p>Exceptions, creating new translations</p></li>

1884

<li><p>Self-modifying code</p></li>

1885

<li><p>Errors, error contexts, error reporting, suppressions</p></li>

1886

<li><p>Client malloc/free</p></li>

1887

<li><p>Low-level memory management</p></li>

1888

<li><p>A and V bitmaps</p></li>

1889

<li><p>Symbol table management</p></li>

1890

<li><p>Dealing with system calls</p></li>

1891

<li><p>Namespace management</p></li>

1892

<li><p>GDB attaching</p></li>

1893

<li><p>Non-dependence on glibc or anything else</p></li>

1894

<li><p>The leak detector</p></li>

1895

<li><p>Performance problems</p></li>

1896

<li><p>Continuous sanity checking</p></li>

1897

<li><p>Tracing, or not tracing, child processes</p></li>

1898

<li><p>Assembly glue for syscalls</p></li>

1899

</ol></div>

1900

</div>

1901

</div>

1902

1903

1904

<a name="mc-tech-docs.extensions"></a>1.3.�Extensions</h2></div></div></div>

1905

<p>Some comments about Stuff To Do.</p>

1906

1907

1908

1909

<p>Stephan Kulow and Marc Mutz report problems with kmail in

1910

KDE 3 CVS (RC2 ish) when run on Valgrind. Stephan has it

1911

deadlocking; Marc has it looping at startup. I can't repro

1912

either behaviour. Needs repro-ing and fixing.</p>

1913

</div>

1914

1915

1916

<a name="mc-tech-docs.threads"></a>1.3.2.�Threads</h3></div></div></div>

1917

<p>Doing a good job of thread support strikes me as almost a

1918

research-level problem. The central issues are how to do fast

1919

cheap locking of the

1920

<code class="computeroutput">VG_(primary_map)</code> structure,

1921

whether or not accesses to the individual secondary maps need

1922

locking, what race-condition issues result, and whether the

1923

already-nasty mess that is the signal simulator needs further

1924

hackery.</p>

1925

<p>I realise that threads are the most-frequently-requested

1926

feature, and I am thinking about it all. If you have guru-level

1927

understanding of fast mutual exclusion mechanisms and race

1928

conditions, I would be interested in hearing from you.</p>

1929

</div>

1930

1931

1932

<a name="mc-tech-docs.verify"></a>1.3.3.�Verification suite</h3></div></div></div>

1933

<p>Directory <code class="computeroutput">tests/</code> contains

1934

various ad-hoc tests for Valgrind. However, there is no

1935

systematic verification or regression suite, that, for example,

1936

exercises all the stuff in <code class="filename">vg_memory.c</code>, to

1937

ensure that illegal memory accesses and undefined value uses are

1938

detected as they should be. It would be good to have such a

1939

suite.</p>

1940

</div>

1941

1942

1943

<a name="mc-tech-docs.porting"></a>1.3.4.�Porting to other platforms</h3></div></div></div>

1944

<p>It would be great if Valgrind was ported to FreeBSD and x86

1945

NetBSD, and to x86 OpenBSD, if it's possible (doesn't OpenBSD use

1946

a.out-style executables, not ELF ?)</p>

1947

<p>The main difficulties, for an x86-ELF platform, seem to

1948

be:</p>

1949

1950

<li><p>You'd need to rewrite the

1951

<code class="computeroutput">/proc/self/maps</code> parser

1952

(<code class="filename">vg_procselfmaps.c</code>). Easy.</p></li>

1953

<li><p>You'd need to rewrite

1954

<code class="filename">vg_syscall_mem.c</code>, or, more specifically,

1955

provide one for your OS. This is tedious, but you can

1956

implement syscalls on demand, and the Linux kernel interface

1957

is, for the most part, going to look very similar to the *BSD

1958

interfaces, so it's really a copy-paste-and-modify-on-demand

1959

job. As part of this, you'd need to supply a new

1960

<code class="filename">vg_kerneliface.h</code> file.</p></li>

1961

<li><p>You'd also need to change the syscall wrappers for

1962

Valgrind's internal use, in

1963

<code class="filename">vg_mylibc.c</code>.</p></li>

1964

</ul></div>

1965

<p>All in all, I think a port to x86-ELF *BSDs is not really

1966

very difficult, and in some ways I would like to see it happen,

1967

because that would force a more clear factoring of Valgrind into

1968

platform dependent and independent pieces. Not to mention, *BSD

1969

folks also deserve to use Valgrind just as much as the Linux crew

1970

do.</p>

1971

</div>

1972

</div>

1973

1974

1975

<a name="mc-tech-docs.easystuff"></a>1.4.�Easy stuff which ought to be done</h2></div></div></div>

1976

1977

1978

<a name="mc-tech-docs.mmx"></a>1.4.1.�MMX Instructions</h3></div></div></div>

1979

<p>MMX insns should be supported, using the same trick as for

1980

FPU insns. If the MMX registers are not used to copy

1981

uninitialised junk from one place to another in memory, this

1982

means we don't have to actually simulate the internal MMX unit

1983

state, so the FPU hack applies. This should be fairly

1984

easy.</p>

1985

</div>

1986

1987

1988

<a name="mc-tech-docs.fixstabs"></a>1.4.2.�Fix stabs-info reader</h3></div></div></div>

1989

<p>The machinery in <code class="filename">vg_symtab2.c</code> which

1990

reads "stabs" style debugging info is pretty weak. It usually

1991

correctly translates simulated program counter values into line

1992

numbers and procedure names, but the file name is often

1993

completely wrong. I think the logic used to parse "stabs"

1994

entries is weak. It should be fixed. The simplest solution,

1995

IMO, is to copy either the logic or simply the code out of GNU

1996

binutils which does this; since GDB can clearly get it right,

1997

binutils (or GDB?) must have code to do this somewhere.</p>

1998

</div>

1999

2000

2001

2002

<p>These are x86 instructions which test, complement, set, or

2003

reset, a single bit in a word. At the moment they are both

2004

incorrectly implemented and incorrectly instrumented.</p>

2005

<p>The incorrect instrumentation is due to use of helper

2006

functions. This means we lose bit-level definedness tracking,

2007

which could wind up giving spurious uninitialised-value use

2008

errors. The Right Thing to do is to invent a couple of new

2009

UOpcodes, I think <code class="computeroutput">GET_BIT</code> and

2010

<code class="computeroutput">SET_BIT</code>, which can be used to

2011

implement all 4 x86 insns, get rid of the helpers, and give

2012

bit-accurate instrumentation rules for the two new

2013

UOpcodes.</p>

2014

<p>I realised the other day that they are mis-implemented too.

2015

The x86 insns take a bit-index and a register or memory location

2016

to access. For registers the bit index clearly can only be in

2017

the range zero to register-width minus 1, and I assumed the same

2018

applied to memory locations too. But evidently not; for memory

2019

locations the index can be arbitrary, and the processor will

2020

index arbitrarily into memory as a result. This too should be

2021

fixed. Sigh. Presumably indexing outside the immediate word is

2022

not actually used by any programs yet tested on Valgrind, for

2023

otherwise they (presumably) would simply not work at all. If you

2024

plan to hack on this, first check the Intel docs to make sure my

2025

understanding is really correct.</p>

2026

</div>

2027

2028

2029

<a name="mc-tech-docs.prefetch"></a>1.4.4.�Using PREFETCH Instructions</h3></div></div></div>

2030

<p>Here's a small but potentially interesting project for

2031

performance junkies. Experiments with valgrind's code generator

2032

and optimiser(s) suggest that reducing the number of instructions

2033

executed in the translations and mem-check helpers gives

2034

disappointingly small performance improvements. Perhaps this is

2035

because performance of Valgrindified code is limited by cache

2036

misses. After all, each read in the original program now gives

2037

rise to at least three reads, one for the

2038

<code class="computeroutput">VG_(primary_map)</code>, one of the

2039

resulting secondary, and the original. Not to mention, the

2040

instrumented translations are 13 to 14 times larger than the

2041

originals. All in all one would expect the memory system to be

2042

hammered to hell and then some.</p>

2043

<p>So here's an idea. An x86 insn involving a read from

2044

memory, after instrumentation, will turn into ucode of the

2045

following form:</p>

2046

2047

... calculate effective addr, into ta and qa ...

2048

TESTVL qa -- is the addr defined?

2049

LOADV (ta), qloaded -- fetch V bits for the addr

2050

LOAD (ta), tloaded -- do the original load</pre>

2051

<p>At the point where the

2052

<code class="computeroutput">LOADV</code> is done, we know the

2053

actual address (<code class="computeroutput">ta</code>) from which

2054

the real <code class="computeroutput">LOAD</code> will be done. We

2055

also know that the <code class="computeroutput">LOADV</code> will

2056

take around 20 x86 insns to do. So it seems plausible that doing

2057

a prefetch of <code class="computeroutput">ta</code> just before the

2058

<code class="computeroutput">LOADV</code> might just avoid a miss at

2059

the <code class="computeroutput">LOAD</code> point, and that might

2060

be a significant performance win.</p>

2061

<p>Prefetch insns are notoriously tempermental, more often

2062

than not making things worse rather than better, so this would

2063

require considerable fiddling around. It's complicated because

2064

Intels and AMDs have different prefetch insns with different

2065

semantics, so that too needs to be taken into account. As a

2066

general rule, even placing the prefetches before the

2067

<code class="computeroutput">LOADV</code> insn is too near the

2068

<code class="computeroutput">LOAD</code>; the ideal distance is

2069

apparently circa 200 CPU cycles. So it might be worth having

2070

another analysis/transformation pass which pushes prefetches as

2071

far back as possible, hopefully immediately after the effective

2072

address becomes available.</p>

2073

<p>Doing too many prefetches is also bad because they soak up

2074

bus bandwidth / cpu resources, so some cleverness in deciding

2075

which loads to prefetch and which to not might be helpful. One

2076

can imagine not prefetching client-stack-relative

2077

(<code class="computeroutput">%EBP</code> or

2078

<code class="computeroutput">%ESP</code>) accesses, since the stack

2079

in general tends to show good locality anyway.</p>

2080

<p>There's quite a lot of experimentation to do here, but I

2081

think it might make an interesting week's work for

2082

someone.</p>

2083

<p>As of 15-ish March 2002, I've started to experiment with

2084

this, using the AMD

2085

<code class="computeroutput">prefetch/prefetchw</code> insns.</p>

2086

</div>

2087

2088

2089

<a name="mc-tech-docs.pranges"></a>1.4.5.�User-defined Permission Ranges</h3></div></div></div>

2090

<p>This is quite a large project -- perhaps a month's hacking

2091

for a capable hacker to do a good job -- but it's potentially

2092

very interesting. The outcome would be that Valgrind could

2093

detect a whole class of bugs which it currently cannot.</p>

2094

<p>The presentation falls into two pieces.</p>

2095

2096

2097

<a name="mc-tech-docs.psetting"></a>1.4.5.1.�Part 1: User-defined Address-range Permission Setting</h4></div></div></div>

2098

<p>Valgrind intercepts the client's

2099

<code class="computeroutput">malloc</code>,

2100

<code class="computeroutput">free</code>, etc calls, watches system

2101

calls, and watches the stack pointer move. This is currently the

2102

only way it knows about which addresses are valid and which not.

2103

Sometimes the client program knows extra information about its

2104

memory areas. For example, the client could at some point know

2105

that all elements of an array are out-of-date. We would like to

2106

be able to convey to Valgrind this information that the array is

2107

now addressable-but-uninitialised, so that Valgrind can then warn

2108

if elements are used before they get new values.</p>

2109

<p>What I would like are some macros like this:</p>

2110

2111

VALGRIND_MAKE_NOACCESS(addr, len)

2112

VALGRIND_MAKE_WRITABLE(addr, len)

2113

VALGRIND_MAKE_READABLE(addr, len)</pre>

2114

<p>and also, to check that memory is

2115

addressible/initialised,</p>

2116

2117

VALGRIND_CHECK_ADDRESSIBLE(addr, len)

2118

VALGRIND_CHECK_INITIALISED(addr, len)</pre>

2119

<p>I then include in my sources a header defining these

2120

macros, rebuild my app, run under Valgrind, and get user-defined

2121

checks.</p>

2122

<p>Now here's a neat trick. It's a nuisance to have to

2123

re-link the app with some new library which implements the above

2124

macros. So the idea is to define the macros so that the

2125

resulting executable is still completely stand-alone, and can be

2126

run without Valgrind, in which case the macros do nothing, but

2127

when run on Valgrind, the Right Thing happens. How to do this?

2128

The idea is for these macros to turn into a piece of inline

2129

assembly code, which (1) has no effect when run on the real CPU,

2130

(2) is easily spotted by Valgrind's JITter, and (3) no sane

2131

person would ever write, which is important for avoiding false

2132

matches in (2). So here's a suggestion:</p>

2133

2134

VALGRIND_MAKE_NOACCESS(addr, len)</pre>

2135

<p>becomes (roughly speaking)</p>

2136

2137

movl addr, %eax

2138

movl len, %ebx

2139

movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be

2140

-- 2, etc

2141

rorl $13, %ecx

2142

rorl $19, %ecx

2143

rorl $11, %eax

2144

rorl $21, %eax</pre>

2145

<p>The rotate sequences have no effect, and it's unlikely they

2146

would appear for any other reason, but they define a unique

2147

byte-sequence which the JITter can easily spot. Using the

2148

operand constraints section at the end of a gcc inline-assembly

2149

statement, we can tell gcc that the assembly fragment kills

2150

<code class="computeroutput">%eax</code>,

2151

<code class="computeroutput">%ebx</code>,

2152

<code class="computeroutput">%ecx</code> and the condition codes, so

2153

this fragment is made harmless when not running on Valgrind, runs

2154

quickly when not on Valgrind, and does not require any other

2155

library support.</p>

2156

</div>

2157

2158

2159

<a name="mc-tech-docs.prange-detect"></a>1.4.5.2.�Part 2: Using it to detect Interference between Stack

2160

Variables</h4></div></div></div>

2161

<p>Currently Valgrind cannot detect errors of the following

2162

form:</p>

2163

2164

void fooble ( void )

2165

{

2166

int a[10];

2167

int b[10];

2168

a[10] = 99;

2169

}</pre>

2170

<p>Now imagine rewriting this as</p>

2171

2172

void fooble ( void )

2173

{

2174

int spacer0;

2175

int a[10];

2176

int spacer1;

2177

int b[10];

2178

int spacer2;

2179

VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));

2180

VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));

2181

VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));

2182

a[10] = 99;

2183

}</pre>

2184

<p>Now the invalid write is certain to hit

2185

<code class="computeroutput">spacer0</code> or

2186

<code class="computeroutput">spacer1</code>, so Valgrind will spot

2187

the error.</p>

2188

<p>There are two complications.</p>

2189

2190

<li><p>The first is that we don't want to annotate sources by

2191

hand, so the Right Thing to do is to write a C/C++ parser,

2192

annotator, prettyprinter which does this automatically, and

2193

run it on post-CPP'd C/C++ source. The parser/prettyprinter

2194

is probably not as hard as it sounds; I would write it in Haskell,

2195

a powerful functional language well suited to doing symbolic

2196

computation, with which I am intimately familar. There is

2197

already a C parser written in Haskell by someone in the

2198

Haskell community, and that would probably be a good starting

2199

point.</p></li>

2200

<li>

2201

<p>The second complication is how to get rid of these

2202

<code class="computeroutput">NOACCESS</code> records inside

2203

Valgrind when the instrumented function exits; after all,

2204

these refer to stack addresses and will make no sense

2205

whatever when some other function happens to re-use the same

2206

stack address range, probably shortly afterwards. I think I

2207

would be inclined to define a special stack-specific

2208

macro:</p>

2209

2210

VALGRIND_MAKE_NOACCESS_STACK(addr, len)</pre>

2211

<p>which causes Valgrind to record the client's

2212

<code class="computeroutput">%ESP</code> at the time it is

2213

executed. Valgrind will then watch for changes in

2214

<code class="computeroutput">%ESP</code> and discard such

2215

records as soon as the protected area is uncovered by an

2216

increase in <code class="computeroutput">%ESP</code>. I

2217

hesitate with this scheme only because it is potentially

2218

expensive, if there are hundreds of such records, and

2219

considering that changes in

2220

<code class="computeroutput">%ESP</code> already require

2221

expensive messing with stack access permissions.</p>

2222

</li>

2223

</ol></div>

2224

<p>This is probably easier and more robust than for the

2225

instrumenter program to try and spot all exit points for the

2226

procedure and place suitable deallocation annotations there.

2227

Plus C++ procedures can bomb out at any point if they get an

2228

exception, so spotting return points at the source level just

2229

won't work at all.</p>

2230

<p>Although some work, it's all eminently doable, and it would

2231

make Valgrind into an even-more-useful tool.</p>

2232

</div>

2233

</div>

2234

</div>

2235

</div>

2236

<div>

2237

2238

<tr>

2239

2240

2241

2242

<td rowspan="2" width="40%" align="right">�<a accesskey="n" href="cg-tech-docs.html">2.�How Cachegrind works�>></a>

2243

</td>

2244

</tr>

2245

2246

</table>

2247

</div>

2248

</body>

2249

</html>

Older »