1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
5
>Row Estimation Examples</TITLE
8
CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
10
HREF="mailto:pgsql-docs@postgresql.org"><LINK
12
TITLE="PostgreSQL 9.1beta1 Documentation"
13
HREF="index.html"><LINK
15
TITLE="How the Planner Uses Statistics"
16
HREF="planner-stats-details.html"><LINK
18
TITLE="How the Planner Uses Statistics"
19
HREF="planner-stats-details.html"><LINK
22
HREF="appendixes.html"><LINK
25
HREF="stylesheet.css"><META
26
HTTP-EQUIV="Content-Type"
27
CONTENT="text/html; charset=ISO-8859-1"><META
29
CONTENT="2011-04-27T21:20:33"></HEAD
35
SUMMARY="Header navigation table"
47
>PostgreSQL 9.1beta1 Documentation</A
56
TITLE="How the Planner Uses Statistics"
57
HREF="planner-stats-details.html"
66
TITLE="How the Planner Uses Statistics"
67
HREF="planner-stats-details.html"
74
>Chapter 57. How the Planner Uses Statistics</TD
80
TITLE="How the Planner Uses Statistics"
81
HREF="planner-stats-details.html"
90
HREF="appendixes.html"
104
NAME="ROW-ESTIMATION-EXAMPLES"
105
>57.1. Row Estimation Examples</A
108
> The examples shown below use tables in the <SPAN
112
regression test database.
113
The outputs shown are taken from version 8.3.
114
The behavior of earlier (or later) versions might vary.
115
Note also that since <TT
118
> uses random sampling
119
while producing statistics, the results will change slightly after
126
> Let's start with a very simple query:
129
CLASS="PROGRAMLISTING"
130
>EXPLAIN SELECT * FROM tenk1;
133
-------------------------------------------------------------
134
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)</PRE
137
How the planner determines the cardinality of <TT
142
HREF="planner-stats.html"
144
>, but is repeated here for
145
completeness. The number of pages and rows is looked up in
152
CLASS="PROGRAMLISTING"
153
>SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
156
----------+-----------
160
These numbers are current as of the last <TT
167
> on the table. The planner then fetches the
168
actual current number of pages in the table (this is a cheap operation,
169
not requiring a table scan). If that is different from
177
> is scaled accordingly to
178
arrive at a current number-of-rows estimate. In this case the values
179
are correct so the rows estimate is the same as
186
> Let's move on to an example with a range condition in its
193
CLASS="PROGRAMLISTING"
194
>EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000;
197
--------------------------------------------------------------------------------
198
Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244)
199
Recheck Cond: (unique1 < 1000)
200
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
201
Index Cond: (unique1 < 1000)</PRE
204
The planner examines the <TT
208
and looks up the selectivity function for the operator
216
This is held in the column <TT
220
and the entry in this case is <CODE
227
> function retrieves the histogram for
235
>. For manual queries it is more
236
convenient to look in the simpler <TT
243
CLASS="PROGRAMLISTING"
244
>SELECT histogram_bounds FROM pg_stats
245
WHERE tablename='tenk1' AND attname='unique1';
248
------------------------------------------------------
249
{0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}</PRE
252
Next the fraction of the histogram occupied by <SPAN
256
is worked out. This is the selectivity. The histogram divides the range
257
into equal frequency buckets, so all we have to do is locate the bucket
258
that our value is in and count <SPAN
271
> of the ones before. The value 1000 is clearly in
272
the second bucket (993-1997). Assuming a linear distribution of
273
values inside each bucket, we can calculate the selectivity as:
276
CLASS="PROGRAMLISTING"
277
>selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
278
= (1 + (1000 - 993)/(1997 - 993))/10
282
that is, one whole bucket plus a linear fraction of the second, divided by
283
the number of buckets. The estimated number of rows can now be calculated as
284
the product of the selectivity and the cardinality of
291
CLASS="PROGRAMLISTING"
292
>rows = rel_cardinality * selectivity
294
= 1007 (rounding off)</PRE
298
> Next let's consider an example with an equality condition in its
305
CLASS="PROGRAMLISTING"
306
>EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
309
----------------------------------------------------------
310
Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244)
311
Filter: (stringu1 = 'CRAAAA'::name)</PRE
314
Again the planner examines the <TT
318
and looks up the selectivity function for <TT
325
>. For equality estimation the histogram is
326
not useful; instead the list of <I
333
>s) is used to determine the
334
selectivity. Let's have a look at the MCVs, with some additional columns
335
that will be useful later:
338
CLASS="PROGRAMLISTING"
339
>SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
340
WHERE tablename='tenk1' AND attname='stringu1';
344
most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
345
most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003} </PRE
351
> appears in the list of MCVs, the selectivity is
352
merely the corresponding entry in the list of most common frequencies
359
CLASS="PROGRAMLISTING"
360
>selectivity = mcf[3]
364
As before, the estimated number of rows is just the product of this with the
371
CLASS="PROGRAMLISTING"
372
>rows = 10000 * 0.003
377
> Now consider the same query, but with a constant that is not in the
384
CLASS="PROGRAMLISTING"
385
>EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
388
----------------------------------------------------------
389
Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244)
390
Filter: (stringu1 = 'xxx'::name)</PRE
393
This is quite a different problem: how to estimate the selectivity when the
404
The approach is to use the fact that the value is not in the list,
405
combined with the knowledge of the frequencies for all of the
412
CLASS="PROGRAMLISTING"
413
>selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
414
= (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
415
0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
419
That is, add up all the frequencies for the <ACRONYM
423
subtract them from one, then
424
divide by the number of <SPAN
431
This amounts to assuming that the fraction of the column that is not any
432
of the MCVs is evenly distributed among all the other distinct values.
433
Notice that there are no null values so we don't have to worry about those
434
(otherwise we'd subtract the null fraction from the numerator as well).
435
The estimated number of rows is then calculated as usual:
438
CLASS="PROGRAMLISTING"
439
>rows = 10000 * 0.0014559
440
= 15 (rounding off)</PRE
444
> The previous example with <TT
446
>unique1 < 1000</TT
448
oversimplification of what <CODE
452
now that we have seen an example of the use of MCVs, we can fill in some
453
more detail. The example was correct as far as it went, because since
457
> is a unique column it has no MCVs (obviously, no
458
value is any more common than any other value). For a non-unique
459
column, there will normally be both a histogram and an MCV list, and
464
>the histogram does not include the portion of the column
465
population represented by the MCVs</I
467
>. We do things this way because
468
it allows more precise estimation. In this situation
472
> directly applies the condition (e.g.,
476
>) to each value of the MCV list, and adds up the
477
frequencies of the MCVs for which the condition is true. This gives
478
an exact estimate of the selectivity within the portion of the table
479
that is MCVs. The histogram is then used in the same way as above
480
to estimate the selectivity in the portion of the table that is not
481
MCVs, and then the two numbers are combined to estimate the overall
482
selectivity. For example, consider
485
CLASS="PROGRAMLISTING"
486
>EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA';
489
------------------------------------------------------------
490
Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244)
491
Filter: (stringu1 < 'IAAAAA'::name)</PRE
494
We already saw the MCV information for <TT
498
and here is its histogram:
501
CLASS="PROGRAMLISTING"
502
>SELECT histogram_bounds FROM pg_stats
503
WHERE tablename='tenk1' AND attname='stringu1';
506
--------------------------------------------------------------------------------
507
{AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}</PRE
510
Checking the MCV list, we find that the condition <TT
514
> is satisfied by the first six entries and not the last four,
515
so the selectivity within the MCV part of the population is
518
CLASS="PROGRAMLISTING"
519
>selectivity = sum(relevant mvfs)
520
= 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
524
Summing all the MCFs also tells us that the total fraction of the
525
population represented by MCVs is 0.03033333, and therefore the
526
fraction represented by the histogram is 0.96966667 (again, there
527
are no nulls, else we'd have to exclude them here). We can see
531
> falls nearly at the end of the
532
third histogram bucket. Using some rather cheesy assumptions
533
about the frequency of different characters, the planner arrives
534
at the estimate 0.298387 for the portion of the histogram population
535
that is less than <TT
538
>. We then combine the estimates
539
for the MCV and non-MCV populations:
542
CLASS="PROGRAMLISTING"
543
>selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
544
= 0.01833333 + 0.298387 * 0.96966667
547
rows = 10000 * 0.307669
548
= 3077 (rounding off)</PRE
551
In this particular example, the correction from the MCV list is fairly
552
small, because the column distribution is actually quite flat (the
553
statistics showing these particular values as being more common than
554
others are mostly due to sampling error). In a more typical case where
555
some values are significantly more common than others, this complicated
556
process gives a useful improvement in accuracy because the selectivity
557
for the most common values is found exactly.
560
> Now let's consider a case with more than one
567
CLASS="PROGRAMLISTING"
568
>EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
571
--------------------------------------------------------------------------------
572
Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244)
573
Recheck Cond: (unique1 < 1000)
574
Filter: (stringu1 = 'xxx'::name)
575
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
576
Index Cond: (unique1 < 1000)</PRE
579
The planner assumes that the two conditions are independent, so that
580
the individual selectivities of the clauses can be multiplied together:
583
CLASS="PROGRAMLISTING"
584
>selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
585
= 0.100697 * 0.0014559
588
rows = 10000 * 0.0001466
589
= 1 (rounding off)</PRE
592
Notice that the number of rows estimated to be returned from the bitmap
593
index scan reflects only the condition used with the index; this is
594
important since it affects the cost estimate for the subsequent heap
598
> Finally we will examine a query that involves a join:
601
CLASS="PROGRAMLISTING"
602
>EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
603
WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
606
--------------------------------------------------------------------------------------
607
Nested Loop (cost=4.64..456.23 rows=50 width=488)
608
-> Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244)
609
Recheck Cond: (unique1 < 50)
610
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0)
611
Index Cond: (unique1 < 50)
612
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244)
613
Index Cond: (unique2 = t1.unique2)</PRE
616
The restriction on <TT
624
is evaluated before the nested-loop join.
625
This is handled analogously to the previous range example. This time the
626
value 50 falls into the first bucket of the
633
CLASS="PROGRAMLISTING"
634
>selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
635
= (0 + (50 - 0)/(993 - 0))/10
638
rows = 10000 * 0.005035
639
= 50 (rounding off)</PRE
642
The restriction for the join is <TT
644
>t2.unique2 = t1.unique2</TT
650
>, however the selectivity function is
651
obtained from the <TT
665
> looks up the statistical information for both
675
CLASS="PROGRAMLISTING"
676
>SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
677
WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
679
tablename | null_frac | n_distinct | most_common_vals
680
-----------+-----------+------------+------------------
682
tenk2 | 0 | -1 |</PRE
685
In this case there is no <ACRONYM
692
> because all the values appear to be
693
unique, so we use an algorithm that relies only on the number of
694
distinct values for both relations together with their null fractions:
697
CLASS="PROGRAMLISTING"
698
>selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
699
= (1 - 0) * (1 - 0) / max(10000, 10000)
703
This is, subtract the null fraction from one for each of the relations,
704
and divide by the maximum of the numbers of distinct values.
706
that the join is likely to emit is calculated as the cardinality of the
707
Cartesian product of the two inputs, multiplied by the
711
CLASS="PROGRAMLISTING"
712
>rows = (outer_cardinality * inner_cardinality) * selectivity
713
= (50 * 10000) * 0.0001
718
> Had there been MCV lists for the two columns,
722
> would have used direct comparison of the MCV
723
lists to determine the join selectivity within the part of the column
724
populations represented by the MCVs. The estimate for the remainder of the
725
populations follows the same approach shown here.
728
> Notice that we showed <TT
730
>inner_cardinality</TT
732
the unmodified size of <TT
735
>. It might appear from
736
inspection of the <TT
739
> output that the estimate of
740
join rows comes from 50 * 1, that is, the number of outer rows times
741
the estimated number of rows obtained by each inner index scan on
745
>. But this is not the case: the join relation size
746
is estimated before any particular join plan has been considered. If
747
everything is working well then the two ways of estimating the join
748
size will produce about the same answer, but due to roundoff error and
749
other factors they sometimes diverge significantly.
752
> For those interested in further details, estimation of the size of
753
a table (before any <TT
756
> clauses) is done in
759
>src/backend/optimizer/util/plancat.c</TT
761
logic for clause selectivities is in
764
>src/backend/optimizer/path/clausesel.c</TT
766
operator-specific selectivity functions are mostly found
769
>src/backend/utils/adt/selfuncs.c</TT
778
SUMMARY="Footer navigation table"
789
HREF="planner-stats-details.html"
807
HREF="appendixes.html"
817
>How the Planner Uses Statistics</TD
823
HREF="planner-stats-details.html"
b'\\ No newline at end of file'