~ubuntu-branches/ubuntu/lucid/postgresql-8.4/lucid-security

Viewing changes to src/backend/tsearch/ts_typanalyze.c

Committer: Bazaar Package Importer
Author(s): Martin Pitt
Date: 2010-10-05 22:05:37 UTC
mfrom: (1.2.5 upstream)
Revision ID: james.westby@ubuntu.com-20101005220537-ek4kjgihiymx0rwd

Tags: 8.4.5-0ubuntu10.04

* New upstream security/bug fix update: (LP: #655293)
  - Use a separate interpreter for each calling SQL userid in PL/Perl
    and PL/Tcl.
    This change prevents security problems that can be caused by
    subverting Perl or Tcl code that will be executed later in the same
    session under another SQL user identity (for example, within a
    SECURITY DEFINER function). Most scripting languages offer numerous
    ways that that might be done, such as redefining standard functions
    or operators called by the target function. Without this change,
    any SQL user with Perl or Tcl language usage rights can do
    essentially anything with the SQL privileges of the target
    function's owner.
    The cost of this change is that intentional communication among
    Perl and Tcl functions becomes more difficult. To provide an escape
    hatch, PL/PerlU and PL/TclU functions continue to use only one
    interpreter per session. This is not considered a security issue
    since all such functions execute at the trust level of a database
    superuser already.
    It is likely that third-party procedural languages that claim to
    offer trusted execution have similar security issues. We advise
    contacting the authors of any PL you are depending on for
    security-critical purposes.
    Our thanks to Tim Bunce for pointing out this issue
    (CVE-2010-3433).
  - Prevent possible crashes in pg_get_expr() by disallowing it from
    being called with an argument that is not one of the system catalog
    columns it's intended to be used with.
  - Fix incorrect placement of placeholder evaluation.
    This bug could result in query outputs being non-null when they
    should be null, in cases where the inner side of an outer join is a
    sub-select with non-strict expressions in its output list.
  - Fix possible duplicate scans of UNION ALL member relations.
  - Fix "cannot handle unplanned sub-select" error.
    This occurred when a sub-select contains a join alias reference
    that expands into an expression containing another sub-select.
  - Fix mishandling of whole-row Vars that reference a view or
    sub-select and appear within a nested sub-select.
  - Fix mishandling of cross-type IN comparisons.
    This could result in failures if the planner tried to implement an
    IN join with a sort-then-unique-then-plain-join plan.
  - Fix computation of "ANALYZE" statistics for tsvector columns.
    The original coding could produce incorrect statistics, leading to
    poor plan choices later.
  - Improve planner's estimate of memory used by array_agg(),
    string_agg(), and similar aggregate functions.
    The previous drastic underestimate could lead to out-of-memory
    failures due to inappropriate choice of a hash-aggregation plan.
  - Fix failure to mark cached plans as transient.
    If a plan is prepared while "CREATE INDEX CONCURRENTLY" is in
    progress for one of the referenced tables, it is supposed to be
    re-planned once the index is ready for use. This was not happening
    reliably.
  - Reduce PANIC to ERROR in some occasionally-reported btree failure
    cases, and provide additional detail in the resulting error
    messages.
    This should improve the system's robustness with corrupted indexes.
  - Fix incorrect search logic for partial-match queries with GIN
    indexes.
    Cases involving AND/OR combination of several GIN index conditions
    didn't always give the right answer, and were sometimes much slower
    than necessary.
  - Prevent show_session_authorization() from crashing within
    autovacuum processes.
  - Defend against functions returning setof record where not all the
    returned rows are actually of the same rowtype.
  - Fix possible corruption of pending trigger event lists during
    subtransaction rollback.
    This could lead to a crash or incorrect firing of triggers.
  - Fix possible failure when hashing a pass-by-reference function
    result.
  - Improve merge join's handling of NULLs in the join columns.
    A merge join can now stop entirely upon reaching the first NULL, if
    the sort order is such that NULLs sort high.
  - Take care to fsync the contents of lockfiles (both "postmaster.pid"
    and the socket lockfile) while writing them.
    This omission could result in corrupted lockfile contents if the
    machine crashes shortly after postmaster start. That could in turn
    prevent subsequent attempts to start the postmaster from
    succeeding, until the lockfile is manually removed.
  - Avoid recursion while assigning XIDs to heavily-nested
    subtransactions.
    The original coding could result in a crash if there was limited
    stack space.
  - Avoid holding open old WAL segments in the walwriter process.
    The previous coding would prevent removal of no-longer-needed
    segments.
  - Fix log_line_prefix's %i escape, which could produce junk early in
    backend startup.
  - Prevent misinterpretation of partially-specified relation options
    for TOAST tables.
    In particular, fillfactor would be read as zero if any other
    reloption had been set for the table, leading to serious bloat.
  - Fix inheritance count tracking in "ALTER TABLE ... ADD CONSTRAINT"
  - Fix possible data corruption in "ALTER TABLE ... SET TABLESPACE"
    when archiving is enabled.
  - Allow "CREATE DATABASE" and "ALTER DATABASE ... SET TABLESPACE" to
    be interrupted by query-cancel.
  - Improve "CREATE INDEX"'s checking of whether proposed index
    expressions are immutable.
  - Fix "REASSIGN OWNED" to handle operator classes and families.
  - Fix possible core dump when comparing two empty tsquery values.
  - Fix LIKE's handling of patterns containing % followed by _.
    We've fixed this before, but there were still some
    incorrectly-handled cases.
  - Re-allow input of Julian dates prior to 0001-01-01 AD.
    Input such as 'J100000'::date worked before 8.4, but was
    unintentionally broken by added error-checking.
  - Fix PL/pgSQL to throw an error, not crash, if a cursor is closed
    within a FOR loop that is iterating over that cursor.
  - In PL/Python, defend against null pointer results from
    PyCObject_AsVoidPtr and PyCObject_FromVoidPtr.
  - In libpq, fix full SSL certificate verification for the case where
    both host and hostaddr are specified.
  - Make psql recognize "DISCARD ALL" as a command that should not be
    encased in a transaction block in autocommit-off mode.
  - Fix some issues in pg_dump's handling of SQL/MED objects.
    Notably, pg_dump would always fail if run by a non-superuser, which
    was not intended.
  - Improve pg_dump and pg_restore's handling of non-seekable archive
    files.
    This is important for proper functioning of parallel restore.
  - Improve parallel pg_restore's ability to cope with selective
    restore (-L option).
    The original code tended to fail if the -L file commanded a
    non-default restore ordering.
  - Fix ecpg to process data from RETURNING clauses correctly.
  - Fix some memory leaks in ecpg.
  - Improve "contrib/dblink"'s handling of tables containing dropped
    columns.
  - Fix connection leak after "duplicate connection name" errors in
    "contrib/dblink".
  - Fix "contrib/dblink" to handle connection names longer than 62
    bytes correctly.
  - Add hstore(text, text) function to "contrib/hstore".
    This function is the recommended substitute for the now-deprecated
    => operator. It was back-patched so that future-proofed code can be
    used with older server versions. Note that the patch will be
    effective only after "contrib/hstore" is installed or reinstalled
    in a particular database. Users might prefer to execute the "CREATE
    FUNCTION" command by hand, instead.
  - Update build infrastructure and documentation to reflect the source
    code repository's move from CVS to Git.

files added:
.gitignore

contrib/adminpack/.gitignore

contrib/btree_gin/.gitignore

contrib/btree_gist/.gitignore

contrib/chkpass/.gitignore

contrib/citext/.gitignore

contrib/cube/.gitignore

contrib/dblink/.gitignore

contrib/dict_int/.gitignore

contrib/dict_xsyn/.gitignore

contrib/earthdistance/.gitignore

contrib/fuzzystrmatch/.gitignore

contrib/hstore/.gitignore

contrib/intarray/.gitignore

contrib/isn/.gitignore

contrib/lo/.gitignore

contrib/ltree/.gitignore

contrib/oid2name/.gitignore

contrib/pageinspect/.gitignore

contrib/pg_archivecleanup

contrib/pg_archivecleanup/.gitignore

contrib/pg_buffercache/.gitignore

contrib/pg_freespacemap/.gitignore

contrib/pg_standby/.gitignore

contrib/pg_stat_statements/.gitignore

contrib/pg_trgm/.gitignore

contrib/pg_upgrade

contrib/pg_upgrade/.gitignore

contrib/pgbench/.gitignore

contrib/pgcrypto/.gitignore

contrib/pgrowlocks/.gitignore

contrib/pgstattuple/.gitignore

contrib/seg/.gitignore

contrib/spi/.gitignore

contrib/sslinfo/.gitignore

contrib/tablefunc/.gitignore

contrib/test_parser/.gitignore

contrib/tsearch2/.gitignore

contrib/unaccent

contrib/unaccent/.gitignore

contrib/vacuumlo/.gitignore

contrib/xml2/.gitignore

doc/src/sgml/.gitignore

doc/src/sgml/sourcerepo.sgml

src/.gitignore

src/backend/.gitignore

src/backend/bootstrap/.gitignore

src/backend/catalog/.gitignore

src/backend/parser/.gitignore

src/backend/po/ko.po

src/backend/po/zh_TW.po

src/backend/port/.gitignore

src/backend/snowball/.gitignore

src/backend/storage/file/copydir.c

src/backend/utils/.gitignore

src/backend/utils/mb/conversion_procs/.gitignore

src/backend/utils/misc/.gitignore

src/bin/initdb/.gitignore

src/bin/initdb/po/ko.po

src/bin/initdb/po/zh_TW.po

src/bin/pg_config/.gitignore

src/bin/pg_config/po/zh_TW.po

src/bin/pg_controldata/.gitignore

src/bin/pg_controldata/po/zh_TW.po

src/bin/pg_ctl/.gitignore

src/bin/pg_ctl/po/zh_TW.po

src/bin/pg_dump/.gitignore

src/bin/pg_dump/po/ko.po

src/bin/pg_dump/po/zh_TW.po

src/bin/pg_resetxlog/.gitignore

src/bin/pg_resetxlog/po/zh_TW.po

src/bin/psql/.gitignore

src/bin/psql/po/ko.po

src/bin/psql/po/zh_TW.po

src/bin/scripts/.gitignore

src/bin/scripts/po/zh_TW.po

src/include/.gitignore

src/include/catalog/.gitignore

src/include/parser/.gitignore

src/include/utils/.gitignore

src/interfaces/ecpg/compatlib/.gitignore

src/interfaces/ecpg/ecpglib/.gitignore

src/interfaces/ecpg/include/.gitignore

src/interfaces/ecpg/pgtypeslib/.gitignore

src/interfaces/ecpg/preproc/.gitignore

src/interfaces/ecpg/preproc/po/ko.po

src/interfaces/ecpg/preproc/po/zh_TW.po

src/interfaces/ecpg/test/.gitignore

src/interfaces/ecpg/test/compat_informix/.gitignore

src/interfaces/ecpg/test/connect/.gitignore

src/interfaces/ecpg/test/pgtypeslib/.gitignore

src/interfaces/ecpg/test/preproc/.gitignore

src/interfaces/ecpg/test/sql/.gitignore

src/interfaces/ecpg/test/thread/.gitignore

src/interfaces/libpq/.gitignore

src/interfaces/libpq/po/zh_TW.po

src/pl/plperl/.gitignore

src/pl/plpgsql/src/.gitignore

src/pl/plpgsql/src/po/ko.po

src/pl/plpgsql/src/po/pt_BR.po

src/pl/plpgsql/src/po/zh_TW.po

src/pl/plpython/.gitignore

src/pl/tcl/.gitignore

src/pl/tcl/modules/.gitignore

src/port/.gitignore

src/test/regress/.gitignore

src/test/regress/expected/.gitignore

src/test/regress/sql/.gitignore

src/timezone/.gitignore

files removed:
contrib/cube/.cvsignore

contrib/seg/.cvsignore

doc/src/sgml/cvs.sgml

src/backend/bootstrap/.cvsignore

src/backend/parser/.cvsignore

src/backend/utils/.cvsignore

src/backend/utils/misc/.cvsignore

src/bin/psql/.cvsignore

src/interfaces/ecpg/compatlib/.cvsignore

src/interfaces/ecpg/ecpglib/.cvsignore

src/interfaces/ecpg/pgtypeslib/.cvsignore

src/interfaces/ecpg/preproc/.cvsignore

src/interfaces/libpq/.cvsignore

src/pl/plpgsql/src/.cvsignore

src/port/copydir.c

files modified:
GNUmakefile.in

HISTORY

INSTALL

configure

configure.in

contrib/dblink/dblink.c

contrib/dblink/expected/dblink.out

contrib/dblink/sql/dblink.sql

contrib/dblink/uninstall_dblink.sql

contrib/hstore/hstore.sql.in

contrib/hstore/uninstall_hstore.sql

contrib/pg_standby/pg_standby.c

debian/changelog

debian/control

doc/bug.template

doc/man.tar.gz

doc/postgres.tar.gz

doc/src/sgml/acronyms.sgml

doc/src/sgml/config.sgml

doc/src/sgml/datatype.sgml

doc/src/sgml/docguide.sgml

doc/src/sgml/earthdistance.sgml

doc/src/sgml/ecpg.sgml

doc/src/sgml/filelist.sgml

doc/src/sgml/hstore.sgml

doc/src/sgml/install-win32.sgml

doc/src/sgml/installation.sgml

doc/src/sgml/libpq.sgml

doc/src/sgml/lobj.sgml

doc/src/sgml/perform.sgml

doc/src/sgml/pgcrypto.sgml

doc/src/sgml/pgstandby.sgml

doc/src/sgml/plperl.sgml

doc/src/sgml/plpgsql.sgml

doc/src/sgml/pltcl.sgml

doc/src/sgml/postgres.sgml

doc/src/sgml/problems.sgml

doc/src/sgml/ref/create_table.sgml

doc/src/sgml/ref/pg_restore.sgml

doc/src/sgml/ref/psql-ref.sgml

doc/src/sgml/release-7.4.sgml

doc/src/sgml/release-8.0.sgml

doc/src/sgml/release-8.1.sgml

doc/src/sgml/release-8.2.sgml

doc/src/sgml/release-8.3.sgml

doc/src/sgml/release-8.4.sgml

doc/src/sgml/release.sgml

doc/src/sgml/runtime.sgml

doc/src/sgml/syntax.sgml

doc/src/sgml/xfunc.sgml

doc/src/sgml/xml2.sgml

src/backend/access/common/reloptions.c

src/backend/access/gin/ginget.c

src/backend/access/gin/ginscan.c

src/backend/access/heap/heapam.c

src/backend/access/nbtree/nbtinsert.c

src/backend/access/nbtree/nbtpage.c

src/backend/access/transam/xact.c

src/backend/access/transam/xlog.c

src/backend/catalog/namespace.c

src/backend/catalog/pg_shdepend.c

src/backend/commands/indexcmds.c

src/backend/commands/opclasscmds.c

src/backend/commands/tablecmds.c

src/backend/commands/trigger.c

src/backend/commands/variable.c

src/backend/executor/execQual.c

src/backend/executor/nodeMergejoin.c

src/backend/executor/nodeSubplan.c

src/backend/nls.mk

src/backend/nodes/copyfuncs.c

src/backend/nodes/equalfuncs.c

src/backend/nodes/outfuncs.c

src/backend/optimizer/path/allpaths.c

src/backend/optimizer/plan/createplan.c

src/backend/optimizer/plan/initsplan.c

src/backend/optimizer/plan/planmain.c

src/backend/optimizer/prep/prepjointree.c

src/backend/optimizer/util/clauses.c

src/backend/optimizer/util/placeholder.c

src/backend/optimizer/util/var.c

src/backend/parser/parse_expr.c

src/backend/parser/parse_func.c

src/backend/parser/parse_oper.c

src/backend/po/de.po

src/backend/po/es.po

src/backend/po/fr.po

src/backend/po/pt_BR.po

src/backend/postmaster/postmaster.c

src/backend/storage/file/Makefile

src/backend/tcop/fastpath.c

src/backend/tsearch/ts_selfuncs.c

src/backend/tsearch/ts_typanalyze.c

src/backend/utils/adt/datetime.c

src/backend/utils/adt/like_match.c

src/backend/utils/adt/tsquery_op.c

src/backend/utils/adt/tsquery_util.c

src/backend/utils/cache/lsyscache.c

src/backend/utils/cache/plancache.c

src/backend/utils/cache/relcache.c

src/backend/utils/cache/typcache.c

src/backend/utils/init/miscinit.c

src/backend/utils/misc/ps_status.c

src/backend/utils/mmgr/portalmem.c

src/bin/initdb/nls.mk

src/bin/initdb/po/cs.po

src/bin/initdb/po/de.po

src/bin/initdb/po/es.po

src/bin/initdb/po/it.po

src/bin/initdb/po/sv.po

src/bin/initdb/po/tr.po

src/bin/initdb/po/zh_CN.po

src/bin/pg_config/nls.mk

src/bin/pg_config/po/es.po

src/bin/pg_config/po/ko.po

src/bin/pg_controldata/nls.mk

src/bin/pg_controldata/po/es.po

src/bin/pg_controldata/po/ko.po

src/bin/pg_ctl/nls.mk

src/bin/pg_ctl/po/es.po

src/bin/pg_ctl/po/ko.po

src/bin/pg_dump/nls.mk

src/bin/pg_dump/pg_backup_archiver.c

src/bin/pg_dump/pg_backup_custom.c

src/bin/pg_dump/pg_dump.c

src/bin/pg_dump/po/de.po

src/bin/pg_dump/po/es.po

src/bin/pg_dump/po/fr.po

src/bin/pg_dump/po/pt_BR.po

src/bin/pg_resetxlog/nls.mk

src/bin/pg_resetxlog/po/es.po

src/bin/pg_resetxlog/po/ko.po

src/bin/psql/common.c

src/bin/psql/mbprint.c

src/bin/psql/nls.mk

src/bin/psql/po/es.po

src/bin/psql/po/fr.po

src/bin/psql/po/pt_BR.po

src/bin/psql/po/sv.po

src/bin/scripts/nls.mk

src/bin/scripts/po/es.po

src/bin/scripts/po/ko.po

src/include/access/gin.h

src/include/commands/defrem.h

src/include/nodes/execnodes.h

src/include/nodes/relation.h

src/include/optimizer/placeholder.h

src/include/parser/parse_func.h

src/include/pg_config.h.win32

src/include/utils/portal.h

src/include/utils/typcache.h

src/interfaces/ecpg/ecpglib/po/es.po

src/interfaces/ecpg/pgtypeslib/numeric.c

src/interfaces/ecpg/preproc/nls.mk

src/interfaces/ecpg/preproc/parse.pl

src/interfaces/ecpg/preproc/po/es.po

src/interfaces/ecpg/preproc/preproc.c

src/interfaces/ecpg/preproc/preproc.y

src/interfaces/ecpg/test/compat_informix/dec_test.pgc

src/interfaces/ecpg/test/expected/compat_informix-dec_test.c

src/interfaces/ecpg/test/expected/pgtypeslib-dt_test.c

src/interfaces/ecpg/test/expected/pgtypeslib-dt_test.stderr

src/interfaces/ecpg/test/expected/pgtypeslib-dt_test2.c

src/interfaces/ecpg/test/expected/pgtypeslib-num_test.c

src/interfaces/ecpg/test/expected/pgtypeslib-num_test.stderr

src/interfaces/ecpg/test/expected/pgtypeslib-num_test2.c

src/interfaces/ecpg/test/expected/sql-array.c

src/interfaces/ecpg/test/pgtypeslib/dt_test.pgc

src/interfaces/ecpg/test/pgtypeslib/dt_test2.pgc

src/interfaces/ecpg/test/pgtypeslib/num_test.pgc

src/interfaces/ecpg/test/pgtypeslib/num_test2.pgc

src/interfaces/ecpg/test/sql/array.pgc

src/interfaces/libpq/fe-auth.c

src/interfaces/libpq/fe-secure.c

src/interfaces/libpq/libpq-dist.rc

src/interfaces/libpq/libpq-int.h

src/interfaces/libpq/libpq.rc.in

src/interfaces/libpq/nls.mk

src/interfaces/libpq/po/de.po

src/interfaces/libpq/po/es.po

src/interfaces/libpq/po/fr.po

src/interfaces/libpq/po/ko.po

src/interfaces/libpq/po/pt_BR.po

src/pl/plperl/plperl.c

src/pl/plperl/po/de.po

src/pl/plperl/po/es.po

src/pl/plperl/po/fr.po

src/pl/plperl/po/pt_BR.po

src/pl/plpgsql/src/gram.y

src/pl/plpgsql/src/nls.mk

src/pl/plpgsql/src/pl_exec.c

src/pl/plpgsql/src/pl_gram.c

src/pl/plpgsql/src/po/de.po

src/pl/plpgsql/src/po/es.po

src/pl/plpgsql/src/po/fr.po

src/pl/plpython/plpython.c

src/pl/plpython/po/de.po

src/pl/plpython/po/es.po

src/pl/plpython/po/fr.po

src/pl/plpython/po/pt_BR.po

src/pl/tcl/pltcl.c

src/pl/tcl/po/es.po

src/port/Makefile

src/port/win32ver.rc

src/test/regress/expected/horology.out

src/test/regress/expected/join.out

src/test/regress/expected/plpgsql.out

src/test/regress/expected/strings.out

src/test/regress/expected/subselect.out

src/test/regress/sql/horology.sql

src/test/regress/sql/join.sql

src/test/regress/sql/plpgsql.sql

src/test/regress/sql/strings.sql

src/test/regress/sql/subselect.sql

src/timezone/data/africa

src/timezone/data/asia

src/timezone/data/australasia

src/timezone/data/backward

src/timezone/data/europe

src/timezone/data/leapseconds

src/timezone/data/northamerica

src/timezone/data/zone.tab

src/timezone/pgtz.c

src/timezone/tznames/Default

src/timezone/tznames/Pacific.txt

src/tools/msvc/Mkvcbuild.pm

src/tools/msvc/vcregress.pl

Show diffs side-by-side

added added

removed removed

src/backend/tsearch/ts_typanalyze.c

* IDENTIFICATION

* $PostgreSQL: pgsql/src/backend/tsearch/ts_typanalyze.c,v 1.7 2009/06/11 14:49:03 momjian Exp $

* $PostgreSQL: pgsql/src/backend/tsearch/ts_typanalyze.c,v 1.7.2.1 2010/05/30 21:59:09 tgl Exp $

*-------------------------------------------------------------------------

* http://www.vldb.org/conf/2002/S10P03.pdf

* The Lossy Counting (aka LC) algorithm goes like this:

* Let D be a set of triples (e, f, d), where e is an element value, f is

* that element's frequency (occurrence count) and d is the maximum error in

* f. We start with D empty and process the elements in batches of size

* w. (The batch size is also known as "bucket size".) Let the current batch

* number be b_current, starting with 1. For each element e we either

100

* increment its f count, if it's already in D, or insert a new triple into D

101

* with values (e, 1, b_current - 1). After processing each batch we prune D,

102

* by removing from it all elements with f + d <= b_current. Finally, we

103

* gather elements with largest f. The LC paper proves error bounds on f

104

* dependent on the batch size w, and shows that the required table size

105

* is no more than a few times w.

106

107

* We use a hashtable for the D structure and a bucket width of

108

* statistics_target * 10, where 10 is an arbitrarily chosen constant,

109

* meant to approximate the number of lexemes in a single tsvector.

* Let s be the threshold frequency for an item (the minimum frequency we

* are interested in) and epsilon the error margin for the frequency. Let D

* be a set of triples (e, f, delta), where e is an element value, f is that

* element's frequency (actually, its current occurrence count) and delta is

* the maximum error in f. We start with D empty and process the elements in

100

* batches of size w. (The batch size is also known as "bucket size" and is

101

* equal to 1/epsilon.) Let the current batch number be b_current, starting

102

* with 1. For each element e we either increment its f count, if it's

103

* already in D, or insert a new triple into D with values (e, 1, b_current

104

* - 1). After processing each batch we prune D, by removing from it all

105

* elements with f + delta <= b_current. After the algorithm finishes we

106

* suppress all elements from D that do not satisfy f >= (s - epsilon) * N,

107

* where N is the total number of elements in the input. We emit the

108

* remaining elements with estimated frequency f/N. The LC paper proves

109

* that this algorithm finds all elements with true frequency at least s,

110

* and that no frequency is overestimated or is underestimated by more than

111

* epsilon. Furthermore, given reasonable assumptions about the input

112

* distribution, the required table size is no more than about 7 times w.

113

114

* We set s to be the estimated frequency of the K'th word in a natural

115

* language's frequency table, where K is the target number of entries in

116

* the MCELEM array plus an arbitrary constant, meant to reflect the fact

117

* that the most common words in any language would usually be stopwords

118

* so we will not actually see them in the input. We assume that the

119

* distribution of word frequencies (including the stopwords) follows Zipf's

120

* law with an exponent of 1.

121

122

* Assuming Zipfian distribution, the frequency of the K'th word is equal

123

* to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of

124

* words in the language. Putting W as one million, we get roughly 0.07/K.

125

* Assuming top 10 words are stopwords gives s = 0.07/(K + 10). We set

126

* epsilon = s/10, which gives bucket width w = (K + 10)/0.007 and

127

* maximum expected hashtable size of about 1000 * (K + 10).

128

129

* Note: in the above discussion, s, epsilon, and f/N are in terms of a

130

* lexeme's frequency as a fraction of all lexemes seen in the input.

131

* However, what we actually want to store in the finished pg_statistic

132

* entry is each lexeme's frequency as a fraction of all rows that it occurs

133

* in. Assuming that the input tsvectors are correctly constructed, no

134

* lexeme occurs more than once per tsvector, so the final count f is a

135

* correct estimate of the number of input tsvectors it occurs in, and we

136

* need only change the divisor from N to nonnull_cnt to get the number we

137

* want.

110

138

111

139

static void

112

140

compute_tsvector_stats(VacAttrStats *stats,

133

161

LexemeHashKey hash_key;

134

162

TrackItem *item;

135

163

136

/* We want statistics_target * 10 lexemes in the MCELEM array */

164

165

* We want statistics_target * 10 lexemes in the MCELEM array. This

166

* multiplier is pretty arbitrary, but is meant to reflect the fact that

167

* the number of individual lexeme values tracked in pg_statistic ought

168

* to be more than the number of values for a simple scalar column.

169

137

170

num_mcelem = stats->attr->attstattarget * 10;

138

171

139

172

140

* We set bucket width equal to the target number of result lexemes. This

141

* is probably about right but perhaps might need to be scaled up or down

142

* a bit?

173

* We set bucket width equal to (num_mcelem + 10) / 0.007 as per the

174

* comment above.

143

175

144

bucket_width = num_mcelem;

176

bucket_width = (num_mcelem + 10) * 1000 / 7;

145

177

146

178

147

179

* Create the hashtable. It will be in local memory, so we don't need to

148

* worry about initial size too much. Also we don't need to pay any

180

* worry about overflowing the initial size. Also we don't need to pay any

149

181

* attention to locking and memory management.

150

182

151

183

MemSet(&hash_ctl, 0, sizeof(hash_ctl));

155

187

hash_ctl.match = lexeme_match;

156

188

hash_ctl.hcxt = CurrentMemoryContext;

157

189

lexemes_tab = hash_create("Analyzed lexemes table",

158

bucket_width * 4,

190

bucket_width * 7,

159

191

&hash_ctl,

160

192

HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);

161

193

162

194

/* Initialize counters. */

163

195

b_current = 1;

164

lexeme_no = 1;

196

lexeme_no = 0;

165

197

166

198

/* Loop over the tsvectors. */

167

199

for (vector_no = 0; vector_no < samplerows; vector_no++)

232

264

item->delta = b_current - 1;

233

265

}

234

266

267

/* lexeme_no is the number of elements processed (ie N) */

268

lexeme_no++;

269

235

270

/* We prune the D structure after processing each bucket */

236

271

if (lexeme_no % bucket_width == 0)

237

272

{

240

275

}

241

276

242

277

/* Advance to the next WordEntry in the tsvector */

243

lexeme_no++;

244

278

curentryptr++;

245

279

}

246

280

}

252

286

int i;

253

287

TrackItem **sort_table;

254

288

int track_len;

289

int cutoff_freq;

255

290

int minfreq,

256

291

maxfreq;

257

292

264

299

stats->stadistinct = -1.0;

265

300

266

301

267

* Determine the top-N lexemes by simply copying pointers from the

268

* hashtable into an array and applying qsort()

302

* Construct an array of the interesting hashtable items, that is,

303

* those meeting the cutoff frequency (s - epsilon)*N. Also identify

304

* the minimum and maximum frequencies among these items.

305

306

* Since epsilon = s/10 and bucket_width = 1/epsilon, the cutoff

307

* frequency is 9*N / bucket_width.

269

308

270

track_len = hash_get_num_entries(lexemes_tab);

309

cutoff_freq = 9 * lexeme_no / bucket_width;

271

310

272

sort_table = (TrackItem **) palloc(sizeof(TrackItem *) * track_len);

311

i = hash_get_num_entries(lexemes_tab); /* surely enough space */

312

sort_table = (TrackItem **) palloc(sizeof(TrackItem *) * i);

273

313

274

314

hash_seq_init(&scan_status, lexemes_tab);

275

i = 0;

315

track_len = 0;

316

minfreq = lexeme_no;

317

maxfreq = 0;

276

318

while ((item = (TrackItem *) hash_seq_search(&scan_status)) != NULL)

277

319

{

278

sort_table[i++] = item;

320

if (item->frequency > cutoff_freq)

321

{

322

sort_table[track_len++] = item;

323

minfreq = Min(minfreq, item->frequency);

324

maxfreq = Max(maxfreq, item->frequency);

325

}

279

326

}

280

Assert(i == track_len);

281

282

qsort(sort_table, track_len, sizeof(TrackItem *),

283

trackitem_compare_frequencies_desc);

284

285

/* Suppress any single-occurrence items */

286

while (track_len > 0)

327

Assert(track_len <= i);

328

329

/* emit some statistics for debug purposes */

330

elog(DEBUG3, "tsvector_stats: target # mces = %d, bucket width = %d, "

331

"# lexemes = %d, hashtable size = %d, usable entries = %d",

332

num_mcelem, bucket_width, lexeme_no, i, track_len);

333

334

335

* If we obtained more lexemes than we really want, get rid of

336

* those with least frequencies. The easiest way is to qsort the

337

* array into descending frequency order and truncate the array.

338

339

if (num_mcelem < track_len)

287

340

{

288

if (sort_table[track_len - 1]->frequency > 1)

289

break;

290

track_len--;

341

qsort(sort_table, track_len, sizeof(TrackItem *),

342

trackitem_compare_frequencies_desc);

343

/* reset minfreq to the smallest frequency we're keeping */

344

minfreq = sort_table[num_mcelem - 1]->frequency;

291

345

}

292

293

/* Determine the number of most common lexemes to be stored */

294

if (num_mcelem > track_len)

346

else

295

347

num_mcelem = track_len;

296

348

297

349

/* Generate MCELEM slot entry */

301

353

Datum *mcelem_values;

302

354

float4 *mcelem_freqs;

303

355

304

/* Grab the minimal and maximal frequencies that will get stored */

305

minfreq = sort_table[num_mcelem - 1]->frequency;

306

maxfreq = sort_table[0]->frequency;

307

308

356

309

357

* We want to store statistics sorted on the lexeme value using

310

358

* first length, then byte-for-byte comparison. The reason for

334

382

mcelem_values = (Datum *) palloc(num_mcelem * sizeof(Datum));

335

383

mcelem_freqs = (float4 *) palloc((num_mcelem + 2) * sizeof(float4));

336

384

385

386

* See comments above about use of nonnull_cnt as the divisor

387

* for the final frequency estimates.

388

337

389

for (i = 0; i < num_mcelem; i++)

338

390

{

339

391

TrackItem *item = sort_table[i];

Older »