~vcs-imports/mammoth-replicator/trunk

---------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

441

name | -0.467008 | {"I- 580 Ramp","I- 880 Ramp","Sp Railroad ","I- 580 ","I- 680 Ramp","I- 80 Ramp","14th St ","5th St ","Mission Blvd","I- 880 "}

442

thepath | 20 | {"[(-122.089,37.71),(-122.0886,37.711)]"}

443

(2 rows)

444

</screen>

445

</para>

446

447

<para>

448

<structname>pg_stats</structname> is described in detail in

449

<xref linkend="view-pg-stats">.

450

</para>

451

452

<para>

453

The amount of information stored in <structname>pg_statistic</structname>,

454

in particular the maximum number of entries in the

455

<structfield>most_common_vals</> and <structfield>histogram_bounds</>

456

arrays for each column, can be set on a

457

column-by-column basis using the <command>ALTER TABLE SET STATISTICS</>

458

command, or globally by setting the

459

<xref linkend="guc-default-statistics-target"> configuration variable.

460

The default limit is presently 10 entries. Raising the limit

461

may allow more accurate planner estimates to be made, particularly for

462

columns with irregular data distributions, at the price of consuming

463

more space in <structname>pg_statistic</structname> and slightly more

464

time to compute the estimates. Conversely, a lower limit may be

465

appropriate for columns with simple data distributions.

466

</para>

467

468

</sect1>

469

470

471

<title>Controlling the Planner with Explicit <literal>JOIN</> Clauses</title>

472

473

474

475

<secondary>controlling the order</secondary>

476

</indexterm>

477

478

<para>

479

It is possible

480

to control the query planner to some extent by using the explicit <literal>JOIN</>

481

syntax. To see why this matters, we first need some background.

482

</para>

483

484

<para>

485

In a simple join query, such as

486

487

SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;

488

</programlisting>

489

the planner is free to join the given tables in any order. For

490

example, it could generate a query plan that joins A to B, using

491

the <literal>WHERE</> condition <literal>a.id = b.id</>, and then

492

joins C to this joined table, using the other <literal>WHERE</>

493

condition. Or it could join B to C and then join A to that result.

494

Or it could join A to C and then join them with B, but that

495

would be inefficient, since the full Cartesian product of A and C

496

would have to be formed, there being no applicable condition in the

497

<literal>WHERE</> clause to allow optimization of the join. (All

498

joins in the <productname>PostgreSQL</productname> executor happen

499

between two input tables, so it's necessary to build up the result

500

in one or another of these fashions.) The important point is that

501

these different join possibilities give semantically equivalent

502

results but may have hugely different execution costs. Therefore,

503

the planner will explore all of them to try to find the most

504

efficient query plan.

505

</para>

506

507

<para>

508

When a query only involves two or three tables, there aren't many join

509

orders to worry about. But the number of possible join orders grows

510

exponentially as the number of tables expands. Beyond ten or so input

511

tables it's no longer practical to do an exhaustive search of all the

512

possibilities, and even for six or seven tables planning may take an

513

annoyingly long time. When there are too many input tables, the

514

<productname>PostgreSQL</productname> planner will switch from exhaustive

515

search to a <firstterm>genetic</firstterm> probabilistic search

516

through a limited number of possibilities. (The switch-over threshold is

517

set by the <xref linkend="guc-geqo-threshold"> run-time

518

parameter.)

519

The genetic search takes less time, but it won't

520

necessarily find the best possible plan.

521

</para>

522

523

<para>

524

When the query involves outer joins, the planner has much less freedom

525

than it does for plain (inner) joins. For example, consider

526

527

SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);

528

</programlisting>

529

Although this query's restrictions are superficially similar to the

530

previous example, the semantics are different because a row must be

531

emitted for each row of A that has no matching row in the join of B and C.

532

Therefore the planner has no choice of join order here: it must join

533

B to C and then join A to that result. Accordingly, this query takes

534

less time to plan than the previous query.

535

</para>

536

537

<para>

538

Explicit inner join syntax (<literal>INNER JOIN</>, <literal>CROSS

539

JOIN</>, or unadorned <literal>JOIN</>) is semantically the same as

540

listing the input relations in <literal>FROM</>, so it does not need to

541

constrain the join order. But it is possible to instruct the

542

<productname>PostgreSQL</productname> query planner to treat

543

explicit inner <literal>JOIN</>s as constraining the join order anyway.

544

For example, these three queries are logically equivalent:

545

546

SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;

547

SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id;

548

SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);

549

</programlisting>

550

But if we tell the planner to honor the <literal>JOIN</> order,

551

the second and third take less time to plan than the first. This effect

552

is not worth worrying about for only three tables, but it can be a

553

lifesaver with many tables.

554

</para>

555

556

<para>

557

To force the planner to follow the <literal>JOIN</> order for inner joins,

558

set the <xref linkend="guc-join-collapse-limit"> run-time parameter to 1.

559

(Other possible values are discussed below.)

560

</para>

561

562

<para>

563

You do not need to constrain the join order completely in order to

564

cut search time, because it's OK to use <literal>JOIN</> operators

565

within items of a plain <literal>FROM</> list. For example, consider

566

567

SELECT * FROM a CROSS JOIN b, c, d, e WHERE ...;

568

</programlisting>

569

With <varname>join_collapse_limit</> = 1, this

570

forces the planner to join A to B before joining them to other tables,

571

but doesn't constrain its choices otherwise. In this example, the

572

number of possible join orders is reduced by a factor of 5.

573

</para>

574

575

<para>

576

Constraining the planner's search in this way is a useful technique

577

both for reducing planning time and for directing the planner to a

578

good query plan. If the planner chooses a bad join order by default,

579

you can force it to choose a better order via <literal>JOIN</> syntax

580

— assuming that you know of a better order, that is. Experimentation

581

is recommended.

582

</para>

583

584

<para>

585

A closely related issue that affects planning time is collapsing of

586

subqueries into their parent query. For example, consider

587

588

SELECT *

589

FROM x, y,

590

(SELECT * FROM a, b, c WHERE something) AS ss

591

WHERE somethingelse;

592

</programlisting>

593

This situation might arise from use of a view that contains a join;

594

the view's <literal>SELECT</> rule will be inserted in place of the view reference,

595

yielding a query much like the above. Normally, the planner will try

596

to collapse the subquery into the parent, yielding

597

598

SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;

599

</programlisting>

600

This usually results in a better plan than planning the subquery

601

separately. (For example, the outer <literal>WHERE</> conditions might be such that

602

joining X to A first eliminates many rows of A, thus avoiding the need to

603

form the full logical output of the subquery.) But at the same time,

604

we have increased the planning time; here, we have a five-way join

605

problem replacing two separate three-way join problems. Because of the

606

exponential growth of the number of possibilities, this makes a big

607

difference. The planner tries to avoid getting stuck in huge join search

608

problems by not collapsing a subquery if more than <varname>from_collapse_limit</>

609

<literal>FROM</> items would result in the parent

610

query. You can trade off planning time against quality of plan by

611

adjusting this run-time parameter up or down.

612

</para>

613

614

<para>

615

<xref linkend="guc-from-collapse-limit"> and <xref

616

linkend="guc-join-collapse-limit">

617

are similarly named because they do almost the same thing: one controls

618

when the planner will <quote>flatten out</> subselects, and the

619

other controls when it will flatten out explicit inner joins. Typically

620

you would either set <varname>join_collapse_limit</> equal to

621

<varname>from_collapse_limit</> (so that explicit joins and subselects

622

act similarly) or set <varname>join_collapse_limit</> to 1 (if you want

623

to control join order with explicit joins). But you might set them

624

differently if you are trying to fine-tune the trade off between planning

625

time and run time.

626

</para>

627

</sect1>

628

629

630

<title>Populating a Database</title>

631

632

<para>

633

One may need to insert a large amount of data when first populating

634

a database. This section contains some suggestions on how to make

635

this process as efficient as possible.

636

</para>

637

638

639

<title>Disable Autocommit</title>

640

641

642

<primary>autocommit</primary>

643

<secondary>bulk-loading data</secondary>

644

</indexterm>

645

646

<para>

647

Turn off autocommit and just do one commit at the end. (In plain

648

SQL, this means issuing <command>BEGIN</command> at the start and

649

<command>COMMIT</command> at the end. Some client libraries may

650

do this behind your back, in which case you need to make sure the

651

library does it when you want it done.) If you allow each

652

insertion to be committed separately,

653

<productname>PostgreSQL</productname> is doing a lot of work for

654

each row that is added. An additional benefit of doing all

655

insertions in one transaction is that if the insertion of one row

656

were to fail then the insertion of all rows inserted up to that

657

point would be rolled back, so you won't be stuck with partially

658

loaded data.

659

</para>

660

</sect2>

661

662

663

664

665

<para>

666

Use <xref linkend="sql-copy" endterm="sql-copy-title"> to load

667

all the rows in one command, instead of using a series of

668

<command>INSERT</command> commands. The <command>COPY</command>

669

command is optimized for loading large numbers of rows; it is less

670

flexible than <command>INSERT</command>, but incurs significantly

671

less overhead for large data loads. Since <command>COPY</command>

672

is a single command, there is no need to disable autocommit if you

673

use this method to populate a table.

674

</para>

675

676

<para>

677

If you cannot use <command>COPY</command>, it may help to use <xref

678

linkend="sql-prepare" endterm="sql-prepare-title"> to create a

679

prepared <command>INSERT</command> statement, and then use

680

<command>EXECUTE</command> as many times as required. This avoids

681

some of the overhead of repeatedly parsing and planning

682

<command>INSERT</command>.

683

</para>

684

685

<para>

686

Note that loading a large number of rows using

687

<command>COPY</command> is almost always faster than using

688

<command>INSERT</command>, even if <command>PREPARE</> is used and

689

multiple insertions are batched into a single transaction.

690

</para>

691

</sect2>

692

693

694

<title>Remove Indexes</title>

695

696

<para>

697

If you are loading a freshly created table, the fastest way is to

698

create the table, bulk load the table's data using

699

<command>COPY</command>, then create any indexes needed for the

700

table. Creating an index on pre-existing data is quicker than

701

updating it incrementally as each row is loaded.

702

</para>

703

704

<para>

705

If you are augmenting an existing table, you can drop the index,

706

load the table, and then recreate the index. Of course, the

707

database performance for other users may be adversely affected

708

during the time that the index is missing. One should also think

709

twice before dropping unique indexes, since the error checking

710

afforded by the unique constraint will be lost while the index is

711

missing.

712

</para>

713

</sect2>

714

715

716

<title>Increase <varname>maintenance_work_mem</varname></title>

717

718

<para>

719

Temporarily increasing the <xref linkend="guc-maintenance-work-mem">

720

configuration variable when loading large amounts of data can

721

lead to improved performance. This is because when a B-tree index

722

is created from scratch, the existing content of the table needs

723

to be sorted. Allowing the merge sort to use more memory

724

means that fewer merge passes will be required. A larger setting for

725

<varname>maintenance_work_mem</varname> may also speed up validation

726

of foreign-key constraints.

727

</para>

728

</sect2>

729

730

731

<title>Increase <varname>checkpoint_segments</varname></title>

732

733

<para>

734

Temporarily increasing the <xref

735

linkend="guc-checkpoint-segments"> configuration variable can also

736

make large data loads faster. This is because loading a large

737

amount of data into <productname>PostgreSQL</productname> can

738

cause checkpoints to occur more often than the normal checkpoint

739

frequency (specified by the <varname>checkpoint_timeout</varname>

740

configuration variable). Whenever a checkpoint occurs, all dirty

741

pages must be flushed to disk. By increasing

742

<varname>checkpoint_segments</varname> temporarily during bulk

743

data loads, the number of checkpoints that are required can be

744

reduced.

745

</para>

746

</sect2>

747

748

749

<title>Run <command>ANALYZE</command> Afterwards</title>

750

751

<para>

752

Whenever you have significantly altered the distribution of data

753

within a table, running <xref linkend="sql-analyze"

754

endterm="sql-analyze-title"> is strongly recommended. This

755

includes bulk loading large amounts of data into the table. Running

756

<command>ANALYZE</command> (or <command>VACUUM ANALYZE</command>)

757

ensures that the planner has up-to-date statistics about the

758

table. With no statistics or obsolete statistics, the planner may

759

make poor decisions during query planning, leading to poor

760

performance on any tables with inaccurate or nonexistent

761

statistics.

762

</para>

763

</sect2>

764

</sect1>

765

766

</chapter>

767

768

<!-- Keep this comment at the end of the file

769

Local variables:

770

mode:sgml

771

sgml-omittag:nil

772

sgml-shorttag:t

773

sgml-minimize-attributes:nil

774

sgml-always-quote-attributes:t

775

sgml-indent-step:1

776

sgml-indent-data:t

777

sgml-parent-document:nil

778

sgml-default-dtd-file:"./reference.ced"

779

sgml-exposed-tags:nil

780

sgml-local-catalogs:("/usr/lib/sgml/catalog")

781

sgml-local-ecat-files:nil

782

End:

783

-->

Older »