1
<!-- doc/src/sgml/geqo.sgml -->
6
<firstname>Martin</firstname>
7
<surname>Utesch</surname>
10
University of Mining and Technology
13
Institute of Automatic Control
25
<date>1997-10-02</date>
28
<title>Genetic Query Optimizer</title>
34
Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>)
35
for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
40
<sect1 id="geqo-intro">
41
<title>Query Handling as a Complex Optimization Problem</title>
44
Among all relational operators the most difficult one to process
45
and optimize is the <firstterm>join</firstterm>. The number of
46
possible query plans grows exponentially with the
47
number of joins in the query. Further optimization effort is
48
caused by the support of a variety of <firstterm>join
49
methods</firstterm> (e.g., nested loop, hash join, merge join in
50
<productname>PostgreSQL</productname>) to process individual joins
51
and a diversity of <firstterm>indexes</firstterm> (e.g.,
52
B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
53
access paths for relations.
57
The normal <productname>PostgreSQL</productname> query optimizer
58
performs a <firstterm>near-exhaustive search</firstterm> over the
59
space of alternative strategies. This algorithm, first introduced
60
in IBM's System R database, produces a near-optimal join order,
61
but can take an enormous amount of time and memory space when the
62
number of joins in the query grows large. This makes the ordinary
63
<productname>PostgreSQL</productname> query optimizer
64
inappropriate for queries that join a large number of tables.
68
The Institute of Automatic Control at the University of Mining and
69
Technology, in Freiberg, Germany, encountered some problems when
70
it wanted to use <productname>PostgreSQL</productname> as the
71
backend for a decision support knowledge based system for the
72
maintenance of an electrical power grid. The DBMS needed to handle
73
large join queries for the inference machine of the knowledge
74
based system. The number of joins in these queries made using the
75
normal query optimizer infeasible.
79
In the following we describe the implementation of a
80
<firstterm>genetic algorithm</firstterm> to solve the join
81
ordering problem in a manner that is efficient for queries
82
involving large numbers of joins.
86
<sect1 id="geqo-intro2">
87
<title>Genetic Algorithms</title>
90
The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which
91
operates through randomized search. The set of possible solutions for the
92
optimization problem is considered as a
93
<firstterm>population</firstterm> of <firstterm>individuals</firstterm>.
94
The degree of adaptation of an individual to its environment is specified
95
by its <firstterm>fitness</firstterm>.
99
The coordinates of an individual in the search space are represented
100
by <firstterm>chromosomes</firstterm>, in essence a set of character
101
strings. A <firstterm>gene</firstterm> is a
102
subsection of a chromosome which encodes the value of a single parameter
103
being optimized. Typical encodings for a gene could be <firstterm>binary</firstterm> or
104
<firstterm>integer</firstterm>.
108
Through simulation of the evolutionary operations <firstterm>recombination</firstterm>,
109
<firstterm>mutation</firstterm>, and
110
<firstterm>selection</firstterm> new generations of search points are found
111
that show a higher average fitness than their ancestors.
115
According to the <systemitem class="resource">comp.ai.genetic</> <acronym>FAQ</acronym> it cannot be stressed too
116
strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a
117
problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly
118
non-random (better than random).
121
<figure id="geqo-diagram">
122
<title>Structured Diagram of a Genetic Algorithm</title>
124
<informaltable frame="none">
129
<entry>generation of ancestors at a time t</entry>
133
<entry>P''(t)</entry>
134
<entry>generation of descendants at a time t</entry>
140
<literallayout class="monospaced">
141
+=========================================+
142
|>>>>>>>>>>> Algorithm GA <<<<<<<<<<<<<<|
143
+=========================================+
144
| INITIALIZE t := 0 |
145
+=========================================+
147
+=========================================+
148
| evaluate FITNESS of P(t) |
149
+=========================================+
150
| while not STOPPING CRITERION do |
151
| +-------------------------------------+
152
| | P'(t) := RECOMBINATION{P(t)} |
153
| +-------------------------------------+
154
| | P''(t) := MUTATION{P'(t)} |
155
| +-------------------------------------+
156
| | P(t+1) := SELECTION{P''(t) + P(t)} |
157
| +-------------------------------------+
158
| | evaluate FITNESS of P''(t) |
159
| +-------------------------------------+
161
+===+=====================================+
166
<sect1 id="geqo-pg-intro">
167
<title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title>
170
The <acronym>GEQO</acronym> module approaches the query
171
optimization problem as though it were the well-known traveling salesman
172
problem (<acronym>TSP</acronym>).
173
Possible query plans are encoded as integer strings. Each string
174
represents the join order from one relation of the query to the next.
175
For example, the join tree
176
<literallayout class="monospaced">
182
is encoded by the integer string '4-1-3-2',
183
which means, first join relation '4' and '1', then '3', and
184
then '2', where 1, 2, 3, 4 are relation IDs within the
185
<productname>PostgreSQL</productname> optimizer.
189
Specific characteristics of the <acronym>GEQO</acronym>
190
implementation in <productname>PostgreSQL</productname>
193
<itemizedlist spacing="compact" mark="bullet">
196
Usage of a <firstterm>steady state</firstterm> <acronym>GA</acronym> (replacement of the least fit
197
individuals in a population, not whole-generational replacement)
198
allows fast convergence towards improved query plans. This is
199
essential for query handling with reasonable time;
205
Usage of <firstterm>edge recombination crossover</firstterm>
206
which is especially suited to keep edge losses low for the
207
solution of the <acronym>TSP</acronym> by means of a
208
<acronym>GA</acronym>;
214
Mutation as genetic operator is deprecated so that no repair
215
mechanisms are needed to generate legal <acronym>TSP</acronym> tours.
222
Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
227
The <acronym>GEQO</acronym> module allows
228
the <productname>PostgreSQL</productname> query optimizer to
229
support large join queries effectively through
230
non-exhaustive search.
234
<title>Generating Possible Plans with <acronym>GEQO</acronym></title>
237
The <acronym>GEQO</acronym> planning process uses the standard planner
238
code to generate plans for scans of individual relations. Then join
239
plans are developed using the genetic approach. As shown above, each
240
candidate join plan is represented by a sequence in which to join
241
the base relations. In the initial stage, the <acronym>GEQO</acronym>
242
code simply generates some possible join sequences at random. For each
243
join sequence considered, the standard planner code is invoked to
244
estimate the cost of performing the query using that join sequence.
245
(For each step of the join sequence, all three possible join strategies
246
are considered; and all the initially-determined relation scan plans
247
are available. The estimated cost is the cheapest of these
248
possibilities.) Join sequences with lower estimated cost are considered
249
<quote>more fit</> than those with higher cost. The genetic algorithm
250
discards the least fit candidates. Then new candidates are generated
251
by combining genes of more-fit candidates — that is, by using
252
randomly-chosen portions of known low-cost join sequences to create
253
new sequences for consideration. This process is repeated until a
254
preset number of join sequences have been considered; then the best
255
one found at any time during the search is used to generate the finished
260
This process is inherently nondeterministic, because of the randomized
261
choices made during both the initial population selection and subsequent
262
<quote>mutation</> of the best candidates. To avoid surprising changes
263
of the selected plan, each run of the GEQO algorithm restarts its
264
random number generator with the current <xref linkend="guc-geqo-seed">
265
parameter setting. As long as <varname>geqo_seed</> and the other
266
GEQO parameters are kept fixed, the same plan will be generated for a
267
given query (and other planner inputs such as statistics). To experiment
268
with different search paths, try changing <varname>geqo_seed</>.
273
<sect2 id="geqo-future">
274
<title>Future Implementation Tasks for
275
<productname>PostgreSQL</> <acronym>GEQO</acronym></title>
278
Work is still needed to improve the genetic algorithm parameter
280
In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>,
282
<function>gimme_pool_size</function> and <function>gimme_number_generations</function>,
283
we have to find a compromise for the parameter settings
284
to satisfy two competing demands:
285
<itemizedlist spacing="compact">
288
Optimality of the query plan
300
In the current implementation, the fitness of each candidate join
301
sequence is estimated by running the standard planner's join selection
302
and cost estimation code from scratch. To the extent that different
303
candidates use similar sub-sequences of joins, a great deal of work
304
will be repeated. This could be made significantly faster by retaining
305
cost estimates for sub-joins. The problem is to avoid expending
306
unreasonable amounts of memory on retaining that state.
310
At a more basic level, it is not clear that solving query optimization
311
with a GA algorithm designed for TSP is appropriate. In the TSP case,
312
the cost associated with any substring (partial tour) is independent
313
of the rest of the tour, but this is certainly not true for query
314
optimization. Thus it is questionable whether edge recombination
315
crossover is the most effective mutation procedure.
321
<sect1 id="geqo-biblio">
322
<title>Further Reading</title>
325
The following resources contain additional information about
331
<ulink url="http://www.aip.de/~ast/EvolCompFAQ/">
332
The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink
333
url="news://comp.ai.genetic"></ulink>)
339
<ulink url="http://www.red3d.com/cwr/evolve.html">
340
Evolutionary Computation and its application to art and design</ulink>, by
347
<xref linkend="ELMA04">
353
<xref linkend="FONG">