2
$PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.25 2005-01-05 23:42:02 tgl Exp $
5
<chapter id="overview">
6
<title>Overview of PostgreSQL Internals</title>
11
This chapter originated as part of
12
<xref linkend="SIM98">, Stefan Simkovics'
13
Master's Thesis prepared at Vienna University of Technology under the direction
14
of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
19
This chapter gives an overview of the internal structure of the
20
backend of <productname>PostgreSQL</productname>. After having
21
read the following sections you should have an idea of how a query
22
is processed. This chapter does not aim to provide a detailed
23
description of the internal operation of
24
<productname>PostgreSQL</productname>, as such a document would be
25
very extensive. Rather, this chapter is intended to help the reader
26
understand the general sequence of operations that occur within the
27
backend from the point at which a query is received, to the point
28
at which the results are returned to the client.
31
<sect1 id="query-path">
32
<title>The Path of a Query</title>
35
Here we give a short overview of the stages a query has to pass in
36
order to obtain a result.
42
A connection from an application program to the <productname>PostgreSQL</productname>
43
server has to be established. The application program transmits a
44
query to the server and waits to receive the results sent back by the
51
The <firstterm>parser stage</firstterm> checks the query
52
transmitted by the application
53
program for correct syntax and creates
54
a <firstterm>query tree</firstterm>.
60
The <firstterm>rewrite system</firstterm> takes
61
the query tree created by the parser stage and looks for
62
any <firstterm>rules</firstterm> (stored in the
63
<firstterm>system catalogs</firstterm>) to apply to
64
the query tree. It performs the
65
transformations given in the <firstterm>rule bodies</firstterm>.
69
One application of the rewrite system is in the realization of
70
<firstterm>views</firstterm>.
71
Whenever a query against a view
72
(i.e. a <firstterm>virtual table</firstterm>) is made,
73
the rewrite system rewrites the user's query to
74
a query that accesses the <firstterm>base tables</firstterm> given in
75
the <firstterm>view definition</firstterm> instead.
81
The <firstterm>planner/optimizer</firstterm> takes
82
the (rewritten) query tree and creates a
83
<firstterm>query plan</firstterm> that will be the input to the
84
<firstterm>executor</firstterm>.
88
It does so by first creating all possible <firstterm>paths</firstterm>
89
leading to the same result. For example if there is an index on a
90
relation to be scanned, there are two paths for the
91
scan. One possibility is a simple sequential scan and the other
92
possibility is to use the index. Next the cost for the execution of
93
each path is estimated and the cheapest path is chosen. The cheapest
94
path is expanded into a complete plan that the executor can use.
100
The executor recursively steps through
101
the <firstterm>plan tree</firstterm> and
102
retrieves rows in the way represented by the plan.
103
The executor makes use of the
104
<firstterm>storage system</firstterm> while scanning
105
relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
106
evaluates <firstterm>qualifications</firstterm> and finally hands back the rows derived.
112
In the following sections we will cover each of the above listed items
113
in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
114
control and data structures.
118
<sect1 id="connect-estab">
119
<title>How Connections are Established</title>
122
<productname>PostgreSQL</productname> is implemented using a
123
simple <quote>process per user</> client/server model. In this model
124
there is one <firstterm>client process</firstterm> connected to
125
exactly one <firstterm>server process</firstterm>. As we do not
126
know ahead of time how many connections will be made, we have to
127
use a <firstterm>master process</firstterm> that spawns a new
128
server process every time a connection is requested. This master
129
process is called <literal>postmaster</literal> and listens at a
130
specified TCP/IP port for incoming connections. Whenever a request
131
for a connection is detected the <literal>postmaster</literal>
132
process spawns a new server process called
133
<literal>postgres</literal>. The server tasks
134
(<literal>postgres</literal> processes) communicate with each
135
other using <firstterm>semaphores</firstterm> and
136
<firstterm>shared memory</firstterm> to ensure data integrity
137
throughout concurrent data access.
141
The client process can be any program that understands the
142
<productname>PostgreSQL</productname> protocol described in
143
<xref linkend="protocol">. Many clients are based on the
144
C-language library <application>libpq</>, but several independent
145
implementations of the protocol exist, such as the Java
146
<application>JDBC</> driver.
150
Once a connection is established the client process can send a query
151
to the <firstterm>backend</firstterm> (server). The query is transmitted using plain text,
152
i.e. there is no parsing done in the <firstterm>frontend</firstterm> (client). The
153
server parses the query, creates an <firstterm>execution plan</firstterm>,
154
executes the plan and returns the retrieved rows to the client
155
by transmitting them over the established connection.
159
<sect1 id="parser-stage">
160
<title>The Parser Stage</title>
163
The <firstterm>parser stage</firstterm> consists of two parts:
168
The <firstterm>parser</firstterm> defined in
169
<filename>gram.y</filename> and <filename>scan.l</filename> is
170
built using the Unix tools <application>yacc</application>
171
and <application>lex</application>.
176
The <firstterm>transformation process</firstterm> does
177
modifications and augmentations to the data structures returned by the parser.
184
<title>Parser</title>
187
The parser has to check the query string (which arrives as plain
188
ASCII text) for valid syntax. If the syntax is correct a
189
<firstterm>parse tree</firstterm> is built up and handed back;
190
otherwise an error is returned. The parser and lexer are
191
implemented using the well-known Unix tools <application>yacc</>
192
and <application>lex</>.
196
The <firstterm>lexer</firstterm> is defined in the file
197
<filename>scan.l</filename> and is responsible
198
for recognizing <firstterm>identifiers</firstterm>,
199
the <firstterm>SQL key words</firstterm> etc. For
200
every key word or identifier that is found, a <firstterm>token</firstterm>
201
is generated and handed to the parser.
205
The parser is defined in the file <filename>gram.y</filename> and
206
consists of a set of <firstterm>grammar rules</firstterm> and
207
<firstterm>actions</firstterm> that are executed whenever a rule
208
is fired. The code of the actions (which is actually C code) is
209
used to build up the parse tree.
213
The file <filename>scan.l</filename> is transformed to the C
214
source file <filename>scan.c</filename> using the program
215
<application>lex</application> and <filename>gram.y</filename> is
216
transformed to <filename>gram.c</filename> using
217
<application>yacc</application>. After these transformations
218
have taken place a normal C compiler can be used to create the
219
parser. Never make any changes to the generated C files as they
220
will be overwritten the next time <application>lex</application>
221
or <application>yacc</application> is called.
225
The mentioned transformations and compilations are normally done
226
automatically using the <firstterm>makefiles</firstterm>
227
shipped with the <productname>PostgreSQL</productname>
234
A detailed description of <application>yacc</application> or
235
the grammar rules given in <filename>gram.y</filename> would be
236
beyond the scope of this paper. There are many books and
237
documents dealing with <application>lex</application> and
238
<application>yacc</application>. You should be familiar with
239
<application>yacc</application> before you start to study the
240
grammar given in <filename>gram.y</filename> otherwise you won't
241
understand what happens there.
247
<title>Transformation Process</title>
250
The parser stage creates a parse tree using only fixed rules about
251
the syntactic structure of SQL. It does not make any lookups in the
252
system catalogs, so there is no possibility to understand the detailed
253
semantics of the requested operations. After the parser completes,
254
the <firstterm>transformation process</firstterm> takes the tree handed
255
back by the parser as input and does the semantic interpretation needed
256
to understand which tables, functions, and operators are referenced by
257
the query. The data structure that is built to represent this
258
information is called the <firstterm>query tree</>.
262
The reason for separating raw parsing from semantic analysis is that
263
system catalog lookups can only be done within a transaction, and we
264
do not wish to start a transaction immediately upon receiving a query
265
string. The raw parsing stage is sufficient to identify the transaction
266
control commands (<command>BEGIN</>, <command>ROLLBACK</>, etc), and
267
these can then be correctly executed without any further analysis.
268
Once we know that we are dealing with an actual query (such as
269
<command>SELECT</> or <command>UPDATE</>), it is okay to
270
start a transaction if we're not already in one. Only then can the
271
transformation process be invoked.
275
The query tree created by the transformation process is structurally
276
similar to the raw parse tree in most places, but it has many differences
277
in detail. For example, a <structname>FuncCall</> node in the
278
parse tree represents something that looks syntactically like a function
279
call. This may be transformed to either a <structname>FuncExpr</>
280
or <structname>Aggref</> node depending on whether the referenced
281
name turns out to be an ordinary function or an aggregate function.
282
Also, information about the actual data types of columns and expression
283
results is added to the query tree.
288
<sect1 id="rule-system">
289
<title>The <productname>PostgreSQL</productname> Rule System</title>
292
<productname>PostgreSQL</productname> supports a powerful
293
<firstterm>rule system</firstterm> for the specification
294
of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
295
Originally the <productname>PostgreSQL</productname>
296
rule system consisted of two implementations:
301
The first one worked using <firstterm>row level</firstterm> processing and was
302
implemented deep in the <firstterm>executor</firstterm>. The rule system was
303
called whenever an individual row had been accessed. This
304
implementation was removed in 1995 when the last official release
305
of the <productname>Berkeley Postgres</productname> project was
306
transformed into <productname>Postgres95</productname>.
312
The second implementation of the rule system is a technique
313
called <firstterm>query rewriting</firstterm>.
314
The <firstterm>rewrite system</firstterm> is a module
315
that exists between the <firstterm>parser stage</firstterm> and the
316
<firstterm>planner/optimizer</firstterm>. This technique is still implemented.
323
The query rewriter is discussed in some detail in
324
<xref linkend="rules">, so there is no need to cover it here.
325
We will only point out that both the input and the output of the
326
rewriter are query trees, that is, there is no change in the
327
representation or level of semantic detail in the trees. Rewriting
328
can be thought of as a form of macro expansion.
333
<sect1 id="planner-optimizer">
334
<title>Planner/Optimizer</title>
337
The task of the <firstterm>planner/optimizer</firstterm> is to
338
create an optimal execution plan. A given SQL query (and hence, a
339
query tree) can be actually executed in a wide variety of
340
different ways, each of which will produce the same set of
341
results. If it is computationally feasible, the query optimizer
342
will examine each of these possible execution plans, ultimately
343
selecting the execution plan that is expected to run the fastest.
348
In some situations, examining each possible way in which a query
349
may be executed would take an excessive amount of time and memory
350
space. In particular, this occurs when executing queries
351
involving large numbers of join operations. In order to determine
352
a reasonable (not optimal) query plan in a reasonable amount of
353
time, <productname>PostgreSQL</productname> uses a <xref
354
linkend="geqo" endterm="geqo-title">.
359
The planner's search procedure actually works with data structures
360
called <firstterm>paths</>, which are simply cut-down representations of
361
plans containing only as much information as the planner needs to make
362
its decisions. After the cheapest path is determined, a full-fledged
363
<firstterm>plan tree</> is built to pass to the executor. This represents
364
the desired execution plan in sufficient detail for the executor to run it.
365
In the rest of this section we'll ignore the distinction between paths
370
<title>Generating Possible Plans</title>
373
The planner/optimizer starts by generating plans for scanning each
374
individual relation (table) used in the query. The possible plans
375
are determined by the available indexes on each relation.
376
There is always the possibility of performing a
377
sequential scan on a relation, so a sequential scan plan is always
378
created. Assume an index is defined on a
379
relation (for example a B-tree index) and a query contains the
381
<literal>relation.attribute OPR constant</literal>. If
382
<literal>relation.attribute</literal> happens to match the key of the B-tree
383
index and <literal>OPR</literal> is one of the operators listed in
384
the index's <firstterm>operator class</>, another plan is created using
385
the B-tree index to scan the relation. If there are further indexes
386
present and the restrictions in the query happen to match a key of an
387
index further plans will be considered.
391
After all feasible plans have been found for scanning single relations,
392
plans for joining relations are created. The planner/optimizer
393
preferentially considers joins between any two relations for which there
394
exist a corresponding join clause in the <literal>WHERE</literal> qualification (i.e. for
395
which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
396
exists). Join pairs with no join clause are considered only when there
397
is no other choice, that is, a particular relation has no available
398
join clauses to any other relation. All possible plans are generated for
399
every join pair considered
400
by the planner/optimizer. The three possible join strategies are:
405
<firstterm>nested loop join</firstterm>: The right relation is scanned
406
once for every row found in the left relation. This strategy
407
is easy to implement but can be very time consuming. (However,
408
if the right relation can be scanned with an index scan, this can
409
be a good strategy. It is possible to use values from the current
410
row of the left relation as keys for the index scan of the right.)
416
<firstterm>merge sort join</firstterm>: Each relation is sorted on the join
417
attributes before the join starts. Then the two relations are
418
scanned in parallel, and matching rows are combined to form
419
join rows. This kind of join is more
420
attractive because each relation has to be scanned only once.
421
The required sorting may be achieved either by an explicit sort
422
step, or by scanning the relation in the proper order using an
423
index on the join key.
429
<firstterm>hash join</firstterm>: the right relation is first scanned
430
and loaded into a hash table, using its join attributes as hash keys.
431
Next the left relation is scanned and the
432
appropriate values of every row found are used as hash keys to
433
locate the matching rows in the table.
440
When the query involves more than two relations, the final result
441
must be built up by a tree of join steps, each with two inputs.
442
The planner examines different possible join sequences to find the
447
The finished plan tree consists of sequential or index scans of
448
the base relations, plus nested-loop, merge, or hash join nodes as
449
needed, plus any auxiliary steps needed, such as sort nodes or
450
aggregate-function calculation nodes. Most of these plan node
451
types have the additional ability to do <firstterm>selection</>
452
(discarding rows that do not meet a specified boolean condition)
453
and <firstterm>projection</> (computation of a derived column set
454
based on given column values, that is, evaluation of scalar
455
expressions where needed). One of the responsibilities of the
456
planner is to attach selection conditions from the
457
<literal>WHERE</literal> clause and computation of required
458
output expressions to the most appropriate nodes of the plan
464
<sect1 id="executor">
465
<title>Executor</title>
468
The <firstterm>executor</firstterm> takes the plan handed back by the
469
planner/optimizer and recursively processes it to extract the required set
470
of rows. This is essentially a demand-pull pipeline mechanism.
471
Each time a plan node is called, it must deliver one more row, or
472
report that it is done delivering rows.
476
To provide a concrete example, assume that the top
477
node is a <literal>MergeJoin</literal> node.
478
Before any merge can be done two rows have to be fetched (one from
479
each subplan). So the executor recursively calls itself to
480
process the subplans (it starts with the subplan attached to
481
<literal>lefttree</literal>). The new top node (the top node of the left
482
subplan) is, let's say, a
483
<literal>Sort</literal> node and again recursion is needed to obtain
484
an input row. The child node of the <literal>Sort</literal> might
485
be a <literal>SeqScan</> node, representing actual reading of a table.
486
Execution of this node causes the executor to fetch a row from the
487
table and return it up to the calling node. The <literal>Sort</literal>
488
node will repeatedly call its child to obtain all the rows to be sorted.
489
When the input is exhausted (as indicated by the child node returning
490
a NULL instead of a row), the <literal>Sort</literal> code performs
491
the sort, and finally is able to return its first output row, namely
492
the first one in sorted order. It keeps the remaining rows stored so
493
that it can deliver them in sorted order in response to later demands.
497
The <literal>MergeJoin</literal> node similarly demands the first row
498
from its right subplan. Then it compares the two rows to see if they
499
can be joined; if so, it returns a join row to its caller. On the next
500
call, or immediately if it cannot join the current pair of inputs,
501
it advances to the next row of one table
502
or the other (depending on how the comparison came out), and again
503
checks for a match. Eventually, one subplan or the other is exhausted,
504
and the <literal>MergeJoin</literal> node returns NULL to indicate that
505
no more join rows can be formed.
509
Complex queries may involve many levels of plan nodes, but the general
510
approach is the same: each node computes and returns its next output
511
row each time it is called. Each node is also responsible for applying
512
any selection or projection expressions that were assigned to it by
517
The executor mechanism is used to evaluate all four basic SQL query types:
518
<command>SELECT</>, <command>INSERT</>, <command>UPDATE</>, and
519
<command>DELETE</>. For <command>SELECT</>, the top-level executor
520
code only needs to send each row returned by the query plan tree off
521
to the client. For <command>INSERT</>, each returned row is inserted
522
into the target table specified for the <command>INSERT</>. (A simple
523
<command>INSERT ... VALUES</> command creates a trivial plan tree
524
consisting of a single <literal>Result</> node, which computes just one
525
result row. But <command>INSERT ... SELECT</> may demand the full power
526
of the executor mechanism.) For <command>UPDATE</>, the planner arranges
527
that each computed row includes all the updated column values, plus
528
the <firstterm>TID</> (tuple ID, or row ID) of the original target row;
529
the executor top level uses this information to create a new updated row
530
and mark the old row deleted. For <command>DELETE</>, the only column
531
that is actually returned by the plan is the TID, and the executor top
532
level simply uses the TID to visit each target row and mark it deleted.
539
<!-- Keep this comment at the end of the file
544
sgml-minimize-attributes:nil
545
sgml-always-quote-attributes:t
548
sgml-parent-document:nil
549
sgml-default-dtd-file:"./reference.ced"
550
sgml-exposed-tags:nil
551
sgml-local-catalogs:("/usr/lib/sgml/catalog")
552
sgml-local-ecat-files:nil