~ubuntu-branches/debian/stretch/gcc-6-doc/stretch

Viewing changes to .pc/from-debian-gcc-rename-info-files.diff/gcc/doc/cppinternals.texi

Committer: Package Import Robot
Author(s): Guo Yixuan (郭溢譞)
Date: 2016-05-06 21:45:33 UTC
Revision ID: package-import@ubuntu.com-20160506214533-1ro717riyxkhd4kn

Tags: 6.1.0-1

http://bugs.debian.org/822667

* New upstream branch. (Closes: #822667)
* Synced patches with gcc-6, 6.1.1-1.
* Use https URIs for Vcs-*.
* Bumped standards version to 3.9.8, no changes needed.
* Update debian/copyright.

files added:
.pc

.pc/.quilt_patches

.pc/.quilt_series

.pc/.version

.pc/applied-patches

.pc/fix-direntry.diff

.pc/fix-direntry.diff/gcc

.pc/fix-direntry.diff/gcc/ada

.pc/fix-direntry.diff/gcc/ada/gnat-style.texi

.pc/fix-direntry.diff/gcc/ada/gnat_rm.texi

.pc/fix-direntry.diff/gcc/ada/gnat_ugn.texi

.pc/from-debian-gcc-alpha-ieee-doc.diff

.pc/from-debian-gcc-alpha-ieee-doc.diff/gcc

.pc/from-debian-gcc-alpha-ieee-doc.diff/gcc/doc

.pc/from-debian-gcc-alpha-ieee-doc.diff/gcc/doc/invoke.texi

.pc/from-debian-gcc-gcc-SOURCE_DATE_EPOCH-doc.diff

.pc/from-debian-gcc-gcc-SOURCE_DATE_EPOCH-doc.diff/gcc

.pc/from-debian-gcc-gcc-SOURCE_DATE_EPOCH-doc.diff/gcc/doc

.pc/from-debian-gcc-gcc-SOURCE_DATE_EPOCH-doc.diff/gcc/doc/cppenv.texi

.pc/from-debian-gcc-gdc-6-doc.diff

.pc/from-debian-gcc-gdc-6-doc.diff/gcc

.pc/from-debian-gcc-gdc-6-doc.diff/gcc/doc

.pc/from-debian-gcc-gdc-6-doc.diff/gcc/doc/frontends.texi

.pc/from-debian-gcc-gdc-6-doc.diff/gcc/doc/install.texi

.pc/from-debian-gcc-gdc-6-doc.diff/gcc/doc/invoke.texi

.pc/from-debian-gcc-gdc-6-doc.diff/gcc/doc/sourcebuild.texi

.pc/from-debian-gcc-gdc-6-doc.diff/gcc/doc/standards.texi

.pc/from-debian-gcc-rename-info-files.diff

.pc/from-debian-gcc-rename-info-files.diff/gcc

.pc/from-debian-gcc-rename-info-files.diff/gcc/ada

.pc/from-debian-gcc-rename-info-files.diff/gcc/ada/gnat-style.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/ada/gnat_rm.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/cpp.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/cppinternals.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/extend.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/gcc.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/gccint.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/install.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/invoke.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/libgcc.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/makefile.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/standards.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/fortran

.pc/from-debian-gcc-rename-info-files.diff/gcc/fortran/gfortran.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/go

.pc/from-debian-gcc-rename-info-files.diff/gcc/go/gccgo.texi

.pc/from-debian-gcc-rename-info-files.diff/gcc/java

.pc/from-debian-gcc-rename-info-files.diff/gcc/java/gcj.texi

.pc/from-debian-gcc-rename-info-files.diff/libgomp

.pc/from-debian-gcc-rename-info-files.diff/libgomp/libgomp.texi

.pc/from-debian-gcc-rename-info-files.diff/libitm

.pc/from-debian-gcc-rename-info-files.diff/libitm/libitm.texi

.pc/from-debian-gcc-rename-info-files.diff/libquadmath

.pc/from-debian-gcc-rename-info-files.diff/libquadmath/libquadmath.texi

.pc/gnat-cross-references.diff

.pc/gnat-cross-references.diff/gcc

.pc/gnat-cross-references.diff/gcc/ada

.pc/gnat-cross-references.diff/gcc/ada/gnat_rm.texi

.pc/gnat-cross-references.diff/gcc/ada/gnat_ugn.texi

.pc/gnat-cross-references.diff/gcc/doc

.pc/gnat-cross-references.diff/gcc/doc/invoke.texi

.pc/gnat-cross-references.diff/gcc/doc/standards.texi

.pc/libquadmath.diff

.pc/libquadmath.diff/libquadmath

.pc/libquadmath.diff/libquadmath/libquadmath.texi

debian

debian/Makefile

debian/README.source

debian/changelog

debian/check_gcc_patches

debian/compat

debian/control

debian/convert-debian-gcc-diff

debian/convert-debian-gcc-diff_cmdline-5

debian/copyright

debian/copyright.old

debian/cpp-BV-doc.doc-base.cpp

debian/cpp-BV-doc.doc-base.cppint

debian/cpp-BV-doc.docs

debian/cpp-BV-doc.info

debian/cpp-BV-doc.manpages

debian/extract-doc-tarball-from-upstream

debian/gbp.conf

debian/gcc-BV-doc.doc-base.gcc

debian/gcc-BV-doc.doc-base.gccinstall

debian/gcc-BV-doc.doc-base.gccint

debian/gcc-BV-doc.doc-base.gomp

debian/gcc-BV-doc.doc-base.itm

debian/gcc-BV-doc.doc-base.qmath

debian/gcc-BV-doc.docs

debian/gcc-BV-doc.info

debian/gcc-BV-doc.links

debian/gcc-BV-doc.manpages

debian/gcc-doc-base.manpages

debian/gccgo-BV-doc.doc-base

debian/gccgo-BV-doc.docs

debian/gccgo-BV-doc.info

debian/gccgo-BV-doc.manpages

debian/gcj-BV-doc.doc-base

debian/gcj-BV-doc.docs

debian/gcj-BV-doc.info

debian/gcj-BV-doc.manpages

debian/gfortran-BV-doc.doc-base

debian/gfortran-BV-doc.docs

debian/gfortran-BV-doc.info

debian/gfortran-BV-doc.manpages

debian/gnat-BV-doc.doc-base.rm

debian/gnat-BV-doc.doc-base.style

debian/gnat-BV-doc.doc-base.ugn

debian/gnat-BV-doc.docs

debian/gnat-BV-doc.info

debian/patches

debian/patches/fix-direntry.diff

debian/patches/from-debian-gcc-alpha-ieee-doc.diff

debian/patches/from-debian-gcc-gcc-SOURCE_DATE_EPOCH-doc.diff

debian/patches/from-debian-gcc-gdc-6-doc.diff

debian/patches/from-debian-gcc-rename-info-files.diff

debian/patches/gnat-cross-references.diff

debian/patches/libquadmath.diff

debian/patches/series

debian/rules

debian/source

debian/source/format

debian/watch

files modified:
gcc/ada/gnat-style.texi

gcc/ada/gnat_rm.texi

gcc/ada/gnat_ugn.texi

gcc/doc/cpp.texi

gcc/doc/cppenv.texi

gcc/doc/cppinternals.texi

gcc/doc/extend.texi

gcc/doc/frontends.texi

gcc/doc/gcc.texi

gcc/doc/gccint.texi

gcc/doc/install.texi

gcc/doc/invoke.texi

gcc/doc/libgcc.texi

gcc/doc/makefile.texi

gcc/doc/sourcebuild.texi

gcc/doc/standards.texi

gcc/fortran/gfortran.texi

gcc/go/gccgo.texi

gcc/java/gcj.texi

libgomp/libgomp.texi

libitm/libitm.texi

libquadmath/libquadmath.texi

Show diffs side-by-side

added added

removed removed

.pc/from-debian-gcc-rename-info-files.diff/gcc/doc/cppinternals.texi

\input texinfo

@setfilename cppinternals.info

@settitle The GNU C Preprocessor Internals

@include gcc-common.texi

@ifinfo

@dircategory Software development

@direntry

* Cpplib: (cppinternals). Cpplib internals.

@end direntry

@end ifinfo

@c @smallbook

@c @cropmarks

@c @finalout

@setchapternewpage odd

@ifinfo

This file documents the internals of the GNU C Preprocessor.

Permission is granted to make and distribute verbatim copies of

this manual provided the copyright notice and this permission notice

are preserved on all copies.

@ignore

Permission is granted to process this file through Tex and print the

results, provided the printed document carries copying permission

notice identical to this one except for the removal of this paragraph

(this paragraph not being relevant to the printed manual).

@end ignore

Permission is granted to copy and distribute modified versions of this

manual under the conditions for verbatim copying, provided also that

the entire resulting derived work is distributed under the terms of a

permission notice identical to this one.

Permission is granted to copy and distribute translations of this manual

into another language, under the above conditions for modified versions.

@end ifinfo

@titlepage

@title Cpplib Internals

@versionsubtitle

@author Neil Booth

@page

@vskip 0pt plus 1filll

@c man begin COPYRIGHT

Permission is granted to make and distribute verbatim copies of

this manual provided the copyright notice and this permission notice

are preserved on all copies.

Permission is granted to copy and distribute modified versions of this

manual under the conditions for verbatim copying, provided also that

the entire resulting derived work is distributed under the terms of a

permission notice identical to this one.

Permission is granted to copy and distribute translations of this manual

into another language, under the above conditions for modified versions.

@c man end

@end titlepage

@contents

@page

@ifnottex

@node Top

@top

@chapter Cpplib---the GNU C Preprocessor

The GNU C preprocessor is

implemented as a library, @dfn{cpplib}, so it can be easily shared between

a stand-alone preprocessor, and a preprocessor integrated with the C,

C++ and Objective-C front ends. It is also available for use by other

programs, though this is not recommended as its exposed interface has

not yet reached a point of reasonable stability.

The library has been written to be re-entrant, so that it can be used

to preprocess many files simultaneously if necessary. It has also been

written with the preprocessing token as the fundamental unit; the

preprocessor in previous versions of GCC would operate on text strings

as the fundamental unit.

This brief manual documents the internals of cpplib, and explains some

of the tricky issues. It is intended that, along with the comments in

the source code, a reasonably competent C programmer should be able to

figure out what the code is doing, and why things have been implemented

the way they have.

@menu

* Conventions:: Conventions used in the code.

* Lexer:: The combined C, C++ and Objective-C Lexer.

* Hash Nodes:: All identifiers are entered into a hash table.

* Macro Expansion:: Macro expansion algorithm.

* Token Spacing:: Spacing and paste avoidance issues.

* Line Numbering:: Tracking location within files.

* Guard Macros:: Optimizing header files with guard macros.

100

* Files:: File handling.

101

* Concept Index:: Index.

102

@end menu

103

@end ifnottex

104

105

@node Conventions

106

@unnumbered Conventions

107

@cindex interface

108

@cindex header files

109

110

cpplib has two interfaces---one is exposed internally only, and the

111

other is for both internal and external use.

112

113

The convention is that functions and types that are exposed to multiple

114

files internally are prefixed with @samp{_cpp_}, and are to be found in

115

the file @file{internal.h}. Functions and types exposed to external

116

clients are in @file{cpplib.h}, and prefixed with @samp{cpp_}. For

117

historical reasons this is no longer quite true, but we should strive to

118

stick to it.

119

120

We are striving to reduce the information exposed in @file{cpplib.h} to the

121

bare minimum necessary, and then to keep it there. This makes clear

122

exactly what external clients are entitled to assume, and allows us to

123

change internals in the future without worrying whether library clients

124

are perhaps relying on some kind of undocumented implementation-specific

125

behavior.

126

127

@node Lexer

128

@unnumbered The Lexer

129

@cindex lexer

130

@cindex newlines

131

@cindex escaped newlines

132

133

@section Overview

134

The lexer is contained in the file @file{lex.c}. It is a hand-coded

135

lexer, and not implemented as a state machine. It can understand C, C++

136

and Objective-C source code, and has been extended to allow reasonably

137

successful preprocessing of assembly language. The lexer does not make

138

an initial pass to strip out trigraphs and escaped newlines, but handles

139

them as they are encountered in a single pass of the input file. It

140

returns preprocessing tokens individually, not a line at a time.

141

142

It is mostly transparent to users of the library, since the library's

143

interface for obtaining the next token, @code{cpp_get_token}, takes care

144

of lexing new tokens, handling directives, and expanding macros as

145

necessary. However, the lexer does expose some functionality so that

146

clients of the library can easily spell a given token, such as

147

@code{cpp_spell_token} and @code{cpp_token_len}. These functions are

148

useful when generating diagnostics, and for emitting the preprocessed

149

output.

150

151

@section Lexing a token

152

Lexing of an individual token is handled by @code{_cpp_lex_direct} and

153

its subroutines. In its current form the code is quite complicated,

154

with read ahead characters and such-like, since it strives to not step

155

back in the character stream in preparation for handling non-ASCII file

156

encodings. The current plan is to convert any such files to UTF-8

157

before processing them. This complexity is therefore unnecessary and

158

will be removed, so I'll not discuss it further here.

159

160

The job of @code{_cpp_lex_direct} is simply to lex a token. It is not

161

responsible for issues like directive handling, returning lookahead

162

tokens directly, multiple-include optimization, or conditional block

163

skipping. It necessarily has a minor r@^ole to play in memory

164

management of lexed lines. I discuss these issues in a separate section

165

(@pxref{Lexing a line}).

166

167

The lexer places the token it lexes into storage pointed to by the

168

variable @code{cur_token}, and then increments it. This variable is

169

important for correct diagnostic positioning. Unless a specific line

170

and column are passed to the diagnostic routines, they will examine the

171

@code{line} and @code{col} values of the token just before the location

172

that @code{cur_token} points to, and use that location to report the

173

diagnostic.

174

175

The lexer does not consider whitespace to be a token in its own right.

176

If whitespace (other than a new line) precedes a token, it sets the

177

@code{PREV_WHITE} bit in the token's flags. Each token has its

178

@code{line} and @code{col} variables set to the line and column of the

179

first character of the token. This line number is the line number in

180

the translation unit, and can be converted to a source (file, line) pair

181

using the line map code.

182

183

The first token on a logical, i.e.@: unescaped, line has the flag

184

@code{BOL} set for beginning-of-line. This flag is intended for

185

internal use, both to distinguish a @samp{#} that begins a directive

186

from one that doesn't, and to generate a call-back to clients that want

187

to be notified about the start of every non-directive line with tokens

188

on it. Clients cannot reliably determine this for themselves: the first

189

token might be a macro, and the tokens of a macro expansion do not have

190

the @code{BOL} flag set. The macro expansion may even be empty, and the

191

next token on the line certainly won't have the @code{BOL} flag set.

192

193

New lines are treated specially; exactly how the lexer handles them is

194

context-dependent. The C standard mandates that directives are

195

terminated by the first unescaped newline character, even if it appears

196

in the middle of a macro expansion. Therefore, if the state variable

197

@code{in_directive} is set, the lexer returns a @code{CPP_EOF} token,

198

which is normally used to indicate end-of-file, to indicate

199

end-of-directive. In a directive a @code{CPP_EOF} token never means

200

end-of-file. Conveniently, if the caller was @code{collect_args}, it

201

already handles @code{CPP_EOF} as if it were end-of-file, and reports an

202

error about an unterminated macro argument list.

203

204

The C standard also specifies that a new line in the middle of the

205

arguments to a macro is treated as whitespace. This white space is

206

important in case the macro argument is stringified. The state variable

207

@code{parsing_args} is nonzero when the preprocessor is collecting the

208

arguments to a macro call. It is set to 1 when looking for the opening

209

parenthesis to a function-like macro, and 2 when collecting the actual

210

arguments up to the closing parenthesis, since these two cases need to

211

be distinguished sometimes. One such time is here: the lexer sets the

212

@code{PREV_WHITE} flag of a token if it meets a new line when

213

@code{parsing_args} is set to 2. It doesn't set it if it meets a new

214

line when @code{parsing_args} is 1, since then code like

215

216

@smallexample

217

#define foo() bar

218

foo

219

baz

220

@end smallexample

221

222

@noindent would be output with an erroneous space before @samp{baz}:

223

224

@smallexample

225

foo

226

baz

227

@end smallexample

228

229

This is a good example of the subtlety of getting token spacing correct

230

in the preprocessor; there are plenty of tests in the testsuite for

231

corner cases like this.

232

233

The lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}

234

and @samp{\n\r} as a single new line indicator. This allows it to

235

transparently preprocess MS-DOS, Macintosh and Unix files without their

236

needing to pass through a special filter beforehand.

237

238

We also decided to treat a backslash, either @samp{\} or the trigraph

239

@samp{??/}, separated from one of the above newline indicators by

240

non-comment whitespace only, as intending to escape the newline. It

241

tends to be a typing mistake, and cannot reasonably be mistaken for

242

anything else in any of the C-family grammars. Since handling it this

243

way is not strictly conforming to the ISO standard, the library issues a

244

warning wherever it encounters it.

245

246

Handling newlines like this is made simpler by doing it in one place

247

only. The function @code{handle_newline} takes care of all newline

248

characters, and @code{skip_escaped_newlines} takes care of arbitrarily

249

long sequences of escaped newlines, deferring to @code{handle_newline}

250

to handle the newlines themselves.

251

252

The most painful aspect of lexing ISO-standard C and C++ is handling

253

trigraphs and backlash-escaped newlines. Trigraphs are processed before

254

any interpretation of the meaning of a character is made, and unfortunately

255

there is a trigraph representation for a backslash, so it is possible for

256

the trigraph @samp{??/} to introduce an escaped newline.

257

258

Escaped newlines are tedious because theoretically they can occur

259

anywhere---between the @samp{+} and @samp{=} of the @samp{+=} token,

260

within the characters of an identifier, and even between the @samp{*}

261

and @samp{/} that terminates a comment. Moreover, you cannot be sure

262

there is just one---there might be an arbitrarily long sequence of them.

263

264

So, for example, the routine that lexes a number, @code{parse_number},

265

cannot assume that it can scan forwards until the first non-number

266

character and be done with it, because this could be the @samp{\}

267

introducing an escaped newline, or the @samp{?} introducing the trigraph

268

sequence that represents the @samp{\} of an escaped newline. If it

269

encounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}

270

to skip over any potential escaped newlines before checking whether the

271

number has been finished.

272

273

Similarly code in the main body of @code{_cpp_lex_direct} cannot simply

274

check for a @samp{=} after a @samp{+} character to determine whether it

275

has a @samp{+=} token; it needs to be prepared for an escaped newline of

276

some sort. Such cases use the function @code{get_effective_char}, which

277

returns the first character after any intervening escaped newlines.

278

279

The lexer needs to keep track of the correct column position, including

280

counting tabs as specified by the @option{-ftabstop=} option. This

281

should be done even within C-style comments; they can appear in the

282

middle of a line, and we want to report diagnostics in the correct

283

position for text appearing after the end of the comment.

284

285

@anchor{Invalid identifiers}

286

Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,

287

may be invalid and require a diagnostic. However, if they appear in a

288

macro expansion we don't want to complain with each use of the macro.

289

It is therefore best to catch them during the lexing stage, in

290

@code{parse_identifier}. In both cases, whether a diagnostic is needed

291

or not is dependent upon the lexer's state. For example, we don't want

292

to issue a diagnostic for re-poisoning a poisoned identifier, or for

293

using @code{__VA_ARGS__} in the expansion of a variable-argument macro.

294

Therefore @code{parse_identifier} makes use of state flags to determine

295

whether a diagnostic is appropriate. Since we change state on a

296

per-token basis, and don't lex whole lines at a time, this is not a

297

problem.

298

299

Another place where state flags are used to change behavior is whilst

300

lexing header names. Normally, a @samp{<} would be lexed as a single

301

token. After a @code{#include} directive, though, it should be lexed as

302

a single token as far as the nearest @samp{>} character. Note that we

303

don't allow the terminators of header names to be escaped; the first

304

@samp{"} or @samp{>} terminates the header name.

305

306

Interpretation of some character sequences depends upon whether we are

307

lexing C, C++ or Objective-C, and on the revision of the standard in

308

force. For example, @samp{::} is a single token in C++, but in C it is

309

two separate @samp{:} tokens and almost certainly a syntax error. Such

310

cases are handled by @code{_cpp_lex_direct} based upon command-line

311

flags stored in the @code{cpp_options} structure.

312

313

Once a token has been lexed, it leads an independent existence. The

314

spelling of numbers, identifiers and strings is copied to permanent

315

storage from the original input buffer, so a token remains valid and

316

correct even if its source buffer is freed with @code{_cpp_pop_buffer}.

317

The storage holding the spellings of such tokens remains until the

318

client program calls cpp_destroy, probably at the end of the translation

319

unit.

320

321

@anchor{Lexing a line}

322

@section Lexing a line

323

@cindex token run

324

325

When the preprocessor was changed to return pointers to tokens, one

326

feature I wanted was some sort of guarantee regarding how long a

327

returned pointer remains valid. This is important to the stand-alone

328

preprocessor, the future direction of the C family front ends, and even

329

to cpplib itself internally.

330

331

Occasionally the preprocessor wants to be able to peek ahead in the

332

token stream. For example, after the name of a function-like macro, it

333

wants to check the next token to see if it is an opening parenthesis.

334

Another example is that, after reading the first few tokens of a

335

@code{#pragma} directive and not recognizing it as a registered pragma,

336

it wants to backtrack and allow the user-defined handler for unknown

337

pragmas to access the full @code{#pragma} token stream. The stand-alone

338

preprocessor wants to be able to test the current token with the

339

previous one to see if a space needs to be inserted to preserve their

340

separate tokenization upon re-lexing (paste avoidance), so it needs to

341

be sure the pointer to the previous token is still valid. The

342

recursive-descent C++ parser wants to be able to perform tentative

343

parsing arbitrarily far ahead in the token stream, and then to be able

344

to jump back to a prior position in that stream if necessary.

345

346

The rule I chose, which is fairly natural, is to arrange that the

347

preprocessor lex all tokens on a line consecutively into a token buffer,

348

which I call a @dfn{token run}, and when meeting an unescaped new line

349

(newlines within comments do not count either), to start lexing back at

350

the beginning of the run. Note that we do @emph{not} lex a line of

351

tokens at once; if we did that @code{parse_identifier} would not have

352

state flags available to warn about invalid identifiers (@pxref{Invalid

353

identifiers}).

354

355

In other words, accessing tokens that appeared earlier in the current

356

line is valid, but since each logical line overwrites the tokens of the

357

previous line, tokens from prior lines are unavailable. In particular,

358

since a directive only occupies a single logical line, this means that

359

the directive handlers like the @code{#pragma} handler can jump around

360

in the directive's tokens if necessary.

361

362

Two issues remain: what about tokens that arise from macro expansions,

363

and what happens when we have a long line that overflows the token run?

364

365

Since we promise clients that we preserve the validity of pointers that

366

we have already returned for tokens that appeared earlier in the line,

367

we cannot reallocate the run. Instead, on overflow it is expanded by

368

chaining a new token run on to the end of the existing one.

369

370

The tokens forming a macro's replacement list are collected by the

371

@code{#define} handler, and placed in storage that is only freed by

372

@code{cpp_destroy}. So if a macro is expanded in the line of tokens,

373

the pointers to the tokens of its expansion that are returned will always

374

remain valid. However, macros are a little trickier than that, since

375

they give rise to three sources of fresh tokens. They are the built-in

376

macros like @code{__LINE__}, and the @samp{#} and @samp{##} operators

377

for stringification and token pasting. I handled this by allocating

378

space for these tokens from the lexer's token run chain. This means

379

they automatically receive the same lifetime guarantees as lexed tokens,

380

and we don't need to concern ourselves with freeing them.

381

382

Lexing into a line of tokens solves some of the token memory management

383

issues, but not all. The opening parenthesis after a function-like

384

macro name might lie on a different line, and the front ends definitely

385

want the ability to look ahead past the end of the current line. So

386

cpplib only moves back to the start of the token run at the end of a

387

line if the variable @code{keep_tokens} is zero. Line-buffering is

388

quite natural for the preprocessor, and as a result the only time cpplib

389

needs to increment this variable is whilst looking for the opening

390

parenthesis to, and reading the arguments of, a function-like macro. In

391

the near future cpplib will export an interface to increment and

392

decrement this variable, so that clients can share full control over the

393

lifetime of token pointers too.

394

395

The routine @code{_cpp_lex_token} handles moving to new token runs,

396

calling @code{_cpp_lex_direct} to lex new tokens, or returning

397

previously-lexed tokens if we stepped back in the token stream. It also

398

checks each token for the @code{BOL} flag, which might indicate a

399

directive that needs to be handled, or require a start-of-line call-back

400

to be made. @code{_cpp_lex_token} also handles skipping over tokens in

401

failed conditional blocks, and invalidates the control macro of the

402

multiple-include optimization if a token was successfully lexed outside

403

a directive. In other words, its callers do not need to concern

404

themselves with such issues.

405

406

@node Hash Nodes

407

@unnumbered Hash Nodes

408

@cindex hash table

409

@cindex identifiers

410

@cindex macros

411

@cindex assertions

412

@cindex named operators

413

414

When cpplib encounters an ``identifier'', it generates a hash code for

415

it and stores it in the hash table. By ``identifier'' we mean tokens

416

with type @code{CPP_NAME}; this includes identifiers in the usual C

417

sense, as well as keywords, directive names, macro names and so on. For

418

example, all of @code{pragma}, @code{int}, @code{foo} and

419

@code{__GNUC__} are identifiers and hashed when lexed.

420

421

Each node in the hash table contain various information about the

422

identifier it represents. For example, its length and type. At any one

423

time, each identifier falls into exactly one of three categories:

424

425

@itemize @bullet

426

@item Macros

427

428

These have been declared to be macros, either on the command line or

429

with @code{#define}. A few, such as @code{__TIME__} are built-ins

430

entered in the hash table during initialization. The hash node for a

431

normal macro points to a structure with more information about the

432

macro, such as whether it is function-like, how many arguments it takes,

433

and its expansion. Built-in macros are flagged as special, and instead

434

contain an enum indicating which of the various built-in macros it is.

435

436

@item Assertions

437

438

Assertions are in a separate namespace to macros. To enforce this, cpp

439

actually prepends a @code{#} character before hashing and entering it in

440

the hash table. An assertion's node points to a chain of answers to

441

that assertion.

442

443

@item Void

444

445

Everything else falls into this category---an identifier that is not

446

currently a macro, or a macro that has since been undefined with

447

@code{#undef}.

448

449

When preprocessing C++, this category also includes the named operators,

450

such as @code{xor}. In expressions these behave like the operators they

451

represent, but in contexts where the spelling of a token matters they

452

are spelt differently. This spelling distinction is relevant when they

453

are operands of the stringizing and pasting macro operators @code{#} and

454

@code{##}. Named operator hash nodes are flagged, both to catch the

455

spelling distinction and to prevent them from being defined as macros.

456

@end itemize

457

458

The same identifiers share the same hash node. Since each identifier

459

token, after lexing, contains a pointer to its hash node, this is used

460

to provide rapid lookup of various information. For example, when

461

parsing a @code{#define} statement, CPP flags each argument's identifier

462

hash node with the index of that argument. This makes duplicated

463

argument checking an O(1) operation for each argument. Similarly, for

464

each identifier in the macro's expansion, lookup to see if it is an

465

argument, and which argument it is, is also an O(1) operation. Further,

466

each directive name, such as @code{endif}, has an associated directive

467

enum stored in its hash node, so that directive lookup is also O(1).

468

469

@node Macro Expansion

470

@unnumbered Macro Expansion Algorithm

471

@cindex macro expansion

472

473

Macro expansion is a tricky operation, fraught with nasty corner cases

474

and situations that render what you thought was a nifty way to

475

optimize the preprocessor's expansion algorithm wrong in quite subtle

476

ways.

477

478

I strongly recommend you have a good grasp of how the C and C++

479

standards require macros to be expanded before diving into this

480

section, let alone the code!. If you don't have a clear mental

481

picture of how things like nested macro expansion, stringification and

482

token pasting are supposed to work, damage to your sanity can quickly

483

result.

484

485

@section Internal representation of macros

486

@cindex macro representation (internal)

487

488

The preprocessor stores macro expansions in tokenized form. This

489

saves repeated lexing passes during expansion, at the cost of a small

490

increase in memory consumption on average. The tokens are stored

491

contiguously in memory, so a pointer to the first one and a token

492

count is all you need to get the replacement list of a macro.

493

494

If the macro is a function-like macro the preprocessor also stores its

495

parameters, in the form of an ordered list of pointers to the hash

496

table entry of each parameter's identifier. Further, in the macro's

497

stored expansion each occurrence of a parameter is replaced with a

498

special token of type @code{CPP_MACRO_ARG}. Each such token holds the

499

index of the parameter it represents in the parameter list, which

500

allows rapid replacement of parameters with their arguments during

501

expansion. Despite this optimization it is still necessary to store

502

the original parameters to the macro, both for dumping with e.g.,

503

@option{-dD}, and to warn about non-trivial macro redefinitions when

504

the parameter names have changed.

505

506

@section Macro expansion overview

507

The preprocessor maintains a @dfn{context stack}, implemented as a

508

linked list of @code{cpp_context} structures, which together represent

509

the macro expansion state at any one time. The @code{struct

510

cpp_reader} member variable @code{context} points to the current top

511

of this stack. The top normally holds the unexpanded replacement list

512

of the innermost macro under expansion, except when cpplib is about to

513

pre-expand an argument, in which case it holds that argument's

514

unexpanded tokens.

515

516

When there are no macros under expansion, cpplib is in @dfn{base

517

context}. All contexts other than the base context contain a

518

contiguous list of tokens delimited by a starting and ending token.

519

When not in base context, cpplib obtains the next token from the list

520

of the top context. If there are no tokens left in the list, it pops

521

that context off the stack, and subsequent ones if necessary, until an

522

unexhausted context is found or it returns to base context. In base

523

context, cpplib reads tokens directly from the lexer.

524

525

If it encounters an identifier that is both a macro and enabled for

526

expansion, cpplib prepares to push a new context for that macro on the

527

stack by calling the routine @code{enter_macro_context}. When this

528

routine returns, the new context will contain the unexpanded tokens of

529

the replacement list of that macro. In the case of function-like

530

macros, @code{enter_macro_context} also replaces any parameters in the

531

replacement list, stored as @code{CPP_MACRO_ARG} tokens, with the

532

appropriate macro argument. If the standard requires that the

533

parameter be replaced with its expanded argument, the argument will

534

have been fully macro expanded first.

535

536

@code{enter_macro_context} also handles special macros like

537

@code{__LINE__}. Although these macros expand to a single token which

538

cannot contain any further macros, for reasons of token spacing

539

(@pxref{Token Spacing}) and simplicity of implementation, cpplib

540

handles these special macros by pushing a context containing just that

541

one token.

542

543

The final thing that @code{enter_macro_context} does before returning

544

is to mark the macro disabled for expansion (except for special macros

545

like @code{__TIME__}). The macro is re-enabled when its context is

546

later popped from the context stack, as described above. This strict

547

ordering ensures that a macro is disabled whilst its expansion is

548

being scanned, but that it is @emph{not} disabled whilst any arguments

549

to it are being expanded.

550

551

@section Scanning the replacement list for macros to expand

552

The C standard states that, after any parameters have been replaced

553

with their possibly-expanded arguments, the replacement list is

554

scanned for nested macros. Further, any identifiers in the

555

replacement list that are not expanded during this scan are never

556

again eligible for expansion in the future, if the reason they were

557

not expanded is that the macro in question was disabled.

558

559

Clearly this latter condition can only apply to tokens resulting from

560

argument pre-expansion. Other tokens never have an opportunity to be

561

re-tested for expansion. It is possible for identifiers that are

562

function-like macros to not expand initially but to expand during a

563

later scan. This occurs when the identifier is the last token of an

564

argument (and therefore originally followed by a comma or a closing

565

parenthesis in its macro's argument list), and when it replaces its

566

parameter in the macro's replacement list, the subsequent token

567

happens to be an opening parenthesis (itself possibly the first token

568

of an argument).

569

570

It is important to note that when cpplib reads the last token of a

571

given context, that context still remains on the stack. Only when

572

looking for the @emph{next} token do we pop it off the stack and drop

573

to a lower context. This makes backing up by one token easy, but more

574

importantly ensures that the macro corresponding to the current

575

context is still disabled when we are considering the last token of

576

its replacement list for expansion (or indeed expanding it). As an

577

example, which illustrates many of the points above, consider

578

579

@smallexample

580

#define foo(x) bar x

581

foo(foo) (2)

582

@end smallexample

583

584

@noindent which fully expands to @samp{bar foo (2)}. During pre-expansion

585

of the argument, @samp{foo} does not expand even though the macro is

586

enabled, since it has no following parenthesis [pre-expansion of an

587

argument only uses tokens from that argument; it cannot take tokens

588

from whatever follows the macro invocation]. This still leaves the

589

argument token @samp{foo} eligible for future expansion. Then, when

590

re-scanning after argument replacement, the token @samp{foo} is

591

rejected for expansion, and marked ineligible for future expansion,

592

since the macro is now disabled. It is disabled because the

593

replacement list @samp{bar foo} of the macro is still on the context

594

stack.

595

596

If instead the algorithm looked for an opening parenthesis first and

597

then tested whether the macro were disabled it would be subtly wrong.

598

In the example above, the replacement list of @samp{foo} would be

599

popped in the process of finding the parenthesis, re-enabling

600

@samp{foo} and expanding it a second time.

601

602

@section Looking for a function-like macro's opening parenthesis

603

Function-like macros only expand when immediately followed by a

604

parenthesis. To do this cpplib needs to temporarily disable macros

605

and read the next token. Unfortunately, because of spacing issues

606

(@pxref{Token Spacing}), there can be fake padding tokens in-between,

607

and if the next real token is not a parenthesis cpplib needs to be

608

able to back up that one token as well as retain the information in

609

any intervening padding tokens.

610

611

Backing up more than one token when macros are involved is not

612

permitted by cpplib, because in general it might involve issues like

613

restoring popped contexts onto the context stack, which are too hard.

614

Instead, searching for the parenthesis is handled by a special

615

function, @code{funlike_invocation_p}, which remembers padding

616

information as it reads tokens. If the next real token is not an

617

opening parenthesis, it backs up that one token, and then pushes an

618

extra context just containing the padding information if necessary.

619

620

@section Marking tokens ineligible for future expansion

621

As discussed above, cpplib needs a way of marking tokens as

622

unexpandable. Since the tokens cpplib handles are read-only once they

623

have been lexed, it instead makes a copy of the token and adds the

624

flag @code{NO_EXPAND} to the copy.

625

626

For efficiency and to simplify memory management by avoiding having to

627

remember to free these tokens, they are allocated as temporary tokens

628

from the lexer's current token run (@pxref{Lexing a line}) using the

629

function @code{_cpp_temp_token}. The tokens are then re-used once the

630

current line of tokens has been read in.

631

632

This might sound unsafe. However, tokens runs are not re-used at the

633

end of a line if it happens to be in the middle of a macro argument

634

list, and cpplib only wants to back-up more than one lexer token in

635

situations where no macro expansion is involved, so the optimization

636

is safe.

637

638

@node Token Spacing

639

@unnumbered Token Spacing

640

@cindex paste avoidance

641

@cindex spacing

642

@cindex token spacing

643

644

First, consider an issue that only concerns the stand-alone

645

preprocessor: there needs to be a guarantee that re-reading its preprocessed

646

output results in an identical token stream. Without taking special

647

measures, this might not be the case because of macro substitution.

648

For example:

649

650

@smallexample

651

#define PLUS +

652

#define EMPTY

653

#define f(x) =x=

654

+PLUS -EMPTY- PLUS+ f(=)

655

@expansion{} + + - - + + = = =

656

@emph{not}

657

@expansion{} ++ -- ++ ===

658

@end smallexample

659

660

One solution would be to simply insert a space between all adjacent

661

tokens. However, we would like to keep space insertion to a minimum,

662

both for aesthetic reasons and because it causes problems for people who

663

still try to abuse the preprocessor for things like Fortran source and

664

Makefiles.

665

666

For now, just notice that when tokens are added (or removed, as shown by

667

the @code{EMPTY} example) from the original lexed token stream, we need

668

to check for accidental token pasting. We call this @dfn{paste

669

avoidance}. Token addition and removal can only occur because of macro

670

expansion, but accidental pasting can occur in many places: both before

671

and after each macro replacement, each argument replacement, and

672

additionally each token created by the @samp{#} and @samp{##} operators.

673

674

Look at how the preprocessor gets whitespace output correct

675

normally. The @code{cpp_token} structure contains a flags byte, and one

676

of those flags is @code{PREV_WHITE}. This is flagged by the lexer, and

677

indicates that the token was preceded by whitespace of some form other

678

than a new line. The stand-alone preprocessor can use this flag to

679

decide whether to insert a space between tokens in the output.

680

681

Now consider the result of the following macro expansion:

682

683

@smallexample

684

#define add(x, y, z) x + y +z;

685

sum = add (1,2, 3);

686

@expansion{} sum = 1 + 2 +3;

687

@end smallexample

688

689

The interesting thing here is that the tokens @samp{1} and @samp{2} are

690

output with a preceding space, and @samp{3} is output without a

691

preceding space, but when lexed none of these tokens had that property.

692

Careful consideration reveals that @samp{1} gets its preceding

693

whitespace from the space preceding @samp{add} in the macro invocation,

694

@emph{not} replacement list. @samp{2} gets its whitespace from the

695

space preceding the parameter @samp{y} in the macro replacement list,

696

and @samp{3} has no preceding space because parameter @samp{z} has none

697

in the replacement list.

698

699

Once lexed, tokens are effectively fixed and cannot be altered, since

700

pointers to them might be held in many places, in particular by

701

in-progress macro expansions. So instead of modifying the two tokens

702

above, the preprocessor inserts a special token, which I call a

703

@dfn{padding token}, into the token stream to indicate that spacing of

704

the subsequent token is special. The preprocessor inserts padding

705

tokens in front of every macro expansion and expanded macro argument.

706

These point to a @dfn{source token} from which the subsequent real token

707

should inherit its spacing. In the above example, the source tokens are

708

@samp{add} in the macro invocation, and @samp{y} and @samp{z} in the

709

macro replacement list, respectively.

710

711

It is quite easy to get multiple padding tokens in a row, for example if

712

a macro's first replacement token expands straight into another macro.

713

714

@smallexample

715

#define foo bar

716

#define bar baz

717

[foo]

718

@expansion{} [baz]

719

@end smallexample

720

721

Here, two padding tokens are generated with sources the @samp{foo} token

722

between the brackets, and the @samp{bar} token from foo's replacement

723

list, respectively. Clearly the first padding token is the one to

724

use, so the output code should contain a rule that the first

725

padding token in a sequence is the one that matters.

726

727

But what if a macro expansion is left? Adjusting the above

728

example slightly:

729

730

@smallexample

731

#define foo bar

732

#define bar EMPTY baz

733

#define EMPTY

734

[foo] EMPTY;

735

@expansion{} [ baz] ;

736

@end smallexample

737

738

As shown, now there should be a space before @samp{baz} and the

739

semicolon in the output.

740

741

The rules we decided above fail for @samp{baz}: we generate three

742

padding tokens, one per macro invocation, before the token @samp{baz}.

743

We would then have it take its spacing from the first of these, which

744

carries source token @samp{foo} with no leading space.

745

746

It is vital that cpplib get spacing correct in these examples since any

747

of these macro expansions could be stringified, where spacing matters.

748

749

So, this demonstrates that not just entering macro and argument

750

expansions, but leaving them requires special handling too. I made

751

cpplib insert a padding token with a @code{NULL} source token when

752

leaving macro expansions, as well as after each replaced argument in a

753

macro's replacement list. It also inserts appropriate padding tokens on

754

either side of tokens created by the @samp{#} and @samp{##} operators.

755

I expanded the rule so that, if we see a padding token with a

756

@code{NULL} source token, @emph{and} that source token has no leading

757

space, then we behave as if we have seen no padding tokens at all. A

758

quick check shows this rule will then get the above example correct as

759

well.

760

761

Now a relationship with paste avoidance is apparent: we have to be

762

careful about paste avoidance in exactly the same locations we have

763

padding tokens in order to get white space correct. This makes

764

implementation of paste avoidance easy: wherever the stand-alone

765

preprocessor is fixing up spacing because of padding tokens, and it

766

turns out that no space is needed, it has to take the extra step to

767

check that a space is not needed after all to avoid an accidental paste.

768

The function @code{cpp_avoid_paste} advises whether a space is required

769

between two consecutive tokens. To avoid excessive spacing, it tries

770

hard to only require a space if one is likely to be necessary, but for

771

reasons of efficiency it is slightly conservative and might recommend a

772

space where one is not strictly needed.

773

774

@node Line Numbering

775

@unnumbered Line numbering

776

@cindex line numbers

777

778

@section Just which line number anyway?

779

780

There are three reasonable requirements a cpplib client might have for

781

the line number of a token passed to it:

782

783

@itemize @bullet

784

@item

785

The source line it was lexed on.

786

@item

787

The line it is output on. This can be different to the line it was

788

lexed on if, for example, there are intervening escaped newlines or

789

C-style comments. For example:

790

791

@smallexample

792

foo /* @r{A long

793

comment} */ bar \

794

baz

795

@result{}

796

foo bar baz

797

@end smallexample

798

799

@item

800

If the token results from a macro expansion, the line of the macro name,

801

or possibly the line of the closing parenthesis in the case of

802

function-like macro expansion.

803

@end itemize

804

805

The @code{cpp_token} structure contains @code{line} and @code{col}

806

members. The lexer fills these in with the line and column of the first

807

character of the token. Consequently, but maybe unexpectedly, a token

808

from the replacement list of a macro expansion carries the location of

809

the token within the @code{#define} directive, because cpplib expands a

810

macro by returning pointers to the tokens in its replacement list. The

811

current implementation of cpplib assigns tokens created from built-in

812

macros and the @samp{#} and @samp{##} operators the location of the most

813

recently lexed token. This is a because they are allocated from the

814

lexer's token runs, and because of the way the diagnostic routines infer

815

the appropriate location to report.

816

817

The diagnostic routines in cpplib display the location of the most

818

recently @emph{lexed} token, unless they are passed a specific line and

819

column to report. For diagnostics regarding tokens that arise from

820

macro expansions, it might also be helpful for the user to see the

821

original location in the macro definition that the token came from.

822

Since that is exactly the information each token carries, such an

823

enhancement could be made relatively easily in future.

824

825

The stand-alone preprocessor faces a similar problem when determining

826

the correct line to output the token on: the position attached to a

827

token is fairly useless if the token came from a macro expansion. All

828

tokens on a logical line should be output on its first physical line, so

829

the token's reported location is also wrong if it is part of a physical

830

line other than the first.

831

832

To solve these issues, cpplib provides a callback that is generated

833

whenever it lexes a preprocessing token that starts a new logical line

834

other than a directive. It passes this token (which may be a

835

@code{CPP_EOF} token indicating the end of the translation unit) to the

836

callback routine, which can then use the line and column of this token

837

to produce correct output.

838

839

@section Representation of line numbers

840

841

As mentioned above, cpplib stores with each token the line number that

842

it was lexed on. In fact, this number is not the number of the line in

843

the source file, but instead bears more resemblance to the number of the

844

line in the translation unit.

845

846

The preprocessor maintains a monotonic increasing line count, which is

847

incremented at every new line character (and also at the end of any

848

buffer that does not end in a new line). Since a line number of zero is

849

useful to indicate certain special states and conditions, this variable

850

starts counting from one.

851

852

This variable therefore uniquely enumerates each line in the translation

853

unit. With some simple infrastructure, it is straight forward to map

854

from this to the original source file and line number pair, saving space

855

whenever line number information needs to be saved. The code the

856

implements this mapping lies in the files @file{line-map.c} and

857

@file{line-map.h}.

858

859

Command-line macros and assertions are implemented by pushing a buffer

860

containing the right hand side of an equivalent @code{#define} or

861

@code{#assert} directive. Some built-in macros are handled similarly.

862

Since these are all processed before the first line of the main input

863

file, it will typically have an assigned line closer to twenty than to

864

one.

865

866

@node Guard Macros

867

@unnumbered The Multiple-Include Optimization

868

@cindex guard macros

869

@cindex controlling macros

870

@cindex multiple-include optimization

871

872

Header files are often of the form

873

874

@smallexample

875

#ifndef FOO

876

#define FOO

877

@dots{}

878

#endif

879

@end smallexample

880

881

@noindent

882

to prevent the compiler from processing them more than once. The

883

preprocessor notices such header files, so that if the header file

884

appears in a subsequent @code{#include} directive and @code{FOO} is

885

defined, then it is ignored and it doesn't preprocess or even re-open

886

the file a second time. This is referred to as the @dfn{multiple

887

include optimization}.

888

889

Under what circumstances is such an optimization valid? If the file

890

were included a second time, it can only be optimized away if that

891

inclusion would result in no tokens to return, and no relevant

892

directives to process. Therefore the current implementation imposes

893

requirements and makes some allowances as follows:

894

895

@enumerate

896

@item

897

There must be no tokens outside the controlling @code{#if}-@code{#endif}

898

pair, but whitespace and comments are permitted.

899

900

@item

901

There must be no directives outside the controlling directive pair, but

902

the @dfn{null directive} (a line containing nothing other than a single

903

@samp{#} and possibly whitespace) is permitted.

904

905

@item

906

The opening directive must be of the form

907

908

@smallexample

909

#ifndef FOO

910

@end smallexample

911

912

913

914

@smallexample

915

#if !defined FOO [equivalently, #if !defined(FOO)]

916

@end smallexample

917

918

@item

919

In the second form above, the tokens forming the @code{#if} expression

920

must have come directly from the source file---no macro expansion must

921

have been involved. This is because macro definitions can change, and

922

tracking whether or not a relevant change has been made is not worth the

923

implementation cost.

924

925

@item

926

There can be no @code{#else} or @code{#elif} directives at the outer

927

conditional block level, because they would probably contain something

928

of interest to a subsequent pass.

929

@end enumerate

930

931

First, when pushing a new file on the buffer stack,

932

@code{_stack_include_file} sets the controlling macro @code{mi_cmacro} to

933

@code{NULL}, and sets @code{mi_valid} to @code{true}. This indicates

934

that the preprocessor has not yet encountered anything that would

935

invalidate the multiple-include optimization. As described in the next

936

few paragraphs, these two variables having these values effectively

937

indicates top-of-file.

938

939

When about to return a token that is not part of a directive,

940

@code{_cpp_lex_token} sets @code{mi_valid} to @code{false}. This

941

enforces the constraint that tokens outside the controlling conditional

942

block invalidate the optimization.

943

944

The @code{do_if}, when appropriate, and @code{do_ifndef} directive

945

handlers pass the controlling macro to the function

946

@code{push_conditional}. cpplib maintains a stack of nested conditional

947

blocks, and after processing every opening conditional this function

948

pushes an @code{if_stack} structure onto the stack. In this structure

949

it records the controlling macro for the block, provided there is one

950

and we're at top-of-file (as described above). If an @code{#elif} or

951

@code{#else} directive is encountered, the controlling macro for that

952

block is cleared to @code{NULL}. Otherwise, it survives until the

953

@code{#endif} closing the block, upon which @code{do_endif} sets

954

@code{mi_valid} to true and stores the controlling macro in

955

@code{mi_cmacro}.

956

957

@code{_cpp_handle_directive} clears @code{mi_valid} when processing any

958

directive other than an opening conditional and the null directive.

959

With this, and requiring top-of-file to record a controlling macro, and

960

no @code{#else} or @code{#elif} for it to survive and be copied to

961

@code{mi_cmacro} by @code{do_endif}, we have enforced the absence of

962

directives outside the main conditional block for the optimization to be

963

on.

964

965

Note that whilst we are inside the conditional block, @code{mi_valid} is

966

likely to be reset to @code{false}, but this does not matter since

967

the closing @code{#endif} restores it to @code{true} if appropriate.

968

969

Finally, since @code{_cpp_lex_direct} pops the file off the buffer stack

970

at @code{EOF} without returning a token, if the @code{#endif} directive

971

was not followed by any tokens, @code{mi_valid} is @code{true} and

972

@code{_cpp_pop_file_buffer} remembers the controlling macro associated

973

with the file. Subsequent calls to @code{stack_include_file} result in

974

no buffer being pushed if the controlling macro is defined, effecting

975

the optimization.

976

977

A quick word on how we handle the

978

979

@smallexample

980

#if !defined FOO

981

@end smallexample

982

983

@noindent

984

case. @code{_cpp_parse_expr} and @code{parse_defined} take steps to see

985

whether the three stages @samp{!}, @samp{defined-expression} and

986

@samp{end-of-directive} occur in order in a @code{#if} expression. If

987

so, they return the guard macro to @code{do_if} in the variable

988

@code{mi_ind_cmacro}, and otherwise set it to @code{NULL}.

989

@code{enter_macro_context} sets @code{mi_valid} to false, so if a macro

990

was expanded whilst parsing any part of the expression, then the

991

top-of-file test in @code{push_conditional} fails and the optimization

992

is turned off.

993

994

@node Files

995

@unnumbered File Handling

996

@cindex files

997

998

Fairly obviously, the file handling code of cpplib resides in the file

999

@file{files.c}. It takes care of the details of file searching,

1000

opening, reading and caching, for both the main source file and all the

1001

headers it recursively includes.

1002

1003

The basic strategy is to minimize the number of system calls. On many

1004

systems, the basic @code{open ()} and @code{fstat ()} system calls can

1005

be quite expensive. For every @code{#include}-d file, we need to try

1006

all the directories in the search path until we find a match. Some

1007

projects, such as glibc, pass twenty or thirty include paths on the

1008

command line, so this can rapidly become time consuming.

1009

1010

For a header file we have not encountered before we have little choice

1011

but to do this. However, it is often the case that the same headers are

1012

repeatedly included, and in these cases we try to avoid repeating the

1013

filesystem queries whilst searching for the correct file.

1014

1015

For each file we try to open, we store the constructed path in a splay

1016

tree. This path first undergoes simplification by the function

1017

@code{_cpp_simplify_pathname}. For example,

1018

@file{/usr/include/bits/../foo.h} is simplified to

1019

@file{/usr/include/foo.h} before we enter it in the splay tree and try

1020

to @code{open ()} the file. CPP will then find subsequent uses of

1021

@file{foo.h}, even as @file{/usr/include/foo.h}, in the splay tree and

1022

save system calls.

1023

1024

Further, it is likely the file contents have also been cached, saving a

1025

@code{read ()} system call. We don't bother caching the contents of

1026

header files that are re-inclusion protected, and whose re-inclusion

1027

macro is defined when we leave the header file for the first time. If

1028

the host supports it, we try to map suitably large files into memory,

1029

rather than reading them in directly.

1030

1031

The include paths are internally stored on a null-terminated

1032

singly-linked list, starting with the @code{"header.h"} directory search

1033

chain, which then links into the @code{<header.h>} directory chain.

1034

1035

Files included with the @code{<foo.h>} syntax start the lookup directly

1036

in the second half of this chain. However, files included with the

1037

@code{"foo.h"} syntax start at the beginning of the chain, but with one

1038

extra directory prepended. This is the directory of the current file;

1039

the one containing the @code{#include} directive. Prepending this

1040

directory on a per-file basis is handled by the function

1041

@code{search_from}.

1042

1043

Note that a header included with a directory component, such as

1044

@code{#include "mydir/foo.h"} and opened as

1045

@file{/usr/local/include/mydir/foo.h}, will have the complete path minus

1046

the basename @samp{foo.h} as the current directory.

1047

1048

Enough information is stored in the splay tree that CPP can immediately

1049

tell whether it can skip the header file because of the multiple include

1050

optimization, whether the file didn't exist or couldn't be opened for

1051

some reason, or whether the header was flagged not to be re-used, as it

1052

is with the obsolete @code{#import} directive.

1053

1054

For the benefit of MS-DOS filesystems with an 8.3 filename limitation,

1055

CPP offers the ability to treat various include file names as aliases

1056

for the real header files with shorter names. The map from one to the

1057

other is found in a special file called @samp{header.gcc}, stored in the

1058

command line (or system) include directories to which the mapping

1059

applies. This may be higher up the directory tree than the full path to

1060

the file minus the base name.

1061

1062

@node Concept Index

1063

@unnumbered Concept Index

1064

@printindex cp

1065

1066

@bye

Older »