~lifeless/ubuntu/lucid/bzr/2.1.2-sru

« back to all changes in this revision

Viewing changes to doc/developers/improved_chk_index.txt

Committer: Bazaar Package Importer
Author(s): Jelmer Vernooij
Date: 2009-06-27 15:23:34 UTC
mfrom: (1.3.1 upstream) (3.1.78 karmic)
Revision ID: james.westby@ubuntu.com-20090627152334-u3smexjpaolh96qd

* New upstream release.
* Bump standards version to 3.8.2.

files added:
bzrlib/_bencode_pyx.c

bzrlib/_bencode_pyx.h

bzrlib/_bencode_pyx.pyx

bzrlib/_chk_map_py.py

bzrlib/_chk_map_pyx.c

bzrlib/_chk_map_pyx.pyx

bzrlib/_groupcompress_py.py

bzrlib/_groupcompress_pyx.c

bzrlib/_groupcompress_pyx.pyx

bzrlib/_rio_py.py

bzrlib/_rio_pyx.c

bzrlib/_rio_pyx.pyx

bzrlib/benchmarks/bench_tags.py

bzrlib/bencode.py

bzrlib/chk_map.py

bzrlib/chk_serializer.py

bzrlib/delta.h

bzrlib/diff-delta.c

bzrlib/filters

bzrlib/filters/__init__.py

bzrlib/filters/eol.py

bzrlib/groupcompress.py

bzrlib/help_topics/en/content-filters.txt

bzrlib/help_topics/en/debug-flags.txt

bzrlib/help_topics/en/eol.txt

bzrlib/inventory_delta.py

bzrlib/rename_map.py

bzrlib/repofmt/groupcompress_repo.py

bzrlib/send.py

bzrlib/serializer.py

bzrlib/tests/blackbox/test_dpush.py

bzrlib/tests/blackbox/test_reference.py

bzrlib/tests/bzrdir_implementations/test_push.py

bzrlib/tests/ftp_server

bzrlib/tests/ftp_server/__init__.py

bzrlib/tests/ftp_server/medusa_based.py

bzrlib/tests/ftp_server/pyftpdlib_based.py

bzrlib/tests/per_interbranch/test_pull.py

bzrlib/tests/per_interbranch/test_push.py

bzrlib/tests/per_repository/test_refresh_data.py

bzrlib/tests/per_repository_chk

bzrlib/tests/per_repository_chk/__init__.py

bzrlib/tests/per_repository_chk/test_supported.py

bzrlib/tests/per_repository_chk/test_unsupported.py

bzrlib/tests/per_repository_reference/test_fetch.py

bzrlib/tests/per_repository_reference/test_initialize.py

bzrlib/tests/per_repository_reference/test_unlock.py

bzrlib/tests/test__chk_map.py

bzrlib/tests/test__groupcompress.py

bzrlib/tests/test__rio.py

bzrlib/tests/test_bencode.py

bzrlib/tests/test_chk_map.py

bzrlib/tests/test_chk_serializer.py

bzrlib/tests/test_eol_filters.py

bzrlib/tests/test_filters.py

bzrlib/tests/test_groupcompress.py

bzrlib/tests/test_inventory_delta.py

bzrlib/tests/test_rename_map.py

bzrlib/tests/test_serializer.py

bzrlib/tests/tree_implementations/test_get_file_with_stat.py

bzrlib/tests/workingtree_implementations/test_content_filters.py

bzrlib/tests/workingtree_implementations/test_eol_conversion.py

bzrlib/util/_bencode_py.py

doc/developers/ec2.txt

doc/developers/groupcompress-design.txt

doc/developers/improved_chk_index.txt

doc/en/developer-guide/HACKING.txt

doc/en/user-guide/shelving_changes.txt

doc/index.ru.txt

doc/ru

doc/ru/mini-tutorial

doc/ru/mini-tutorial/index.txt

doc/ru/quick-reference

doc/ru/quick-reference/Makefile

doc/ru/quick-reference/quick-start-summary.pdf

doc/ru/quick-reference/quick-start-summary.png

doc/ru/quick-reference/quick-start-summary.svg

doc/ru/tutorials

doc/ru/tutorials/centralized_workflow.txt

doc/ru/tutorials/tutorial.txt

doc/ru/tutorials/using_bazaar_with_launchpad.txt

doc/ru/user-guide

doc/ru/user-guide/branching_a_project.txt

doc/ru/user-guide/core_concepts.txt

doc/ru/user-guide/images

doc/ru/user-guide/images/workflows_centralized.png

doc/ru/user-guide/images/workflows_centralized.svg

doc/ru/user-guide/images/workflows_gatekeeper.png

doc/ru/user-guide/images/workflows_gatekeeper.svg

doc/ru/user-guide/images/workflows_localcommit.png

doc/ru/user-guide/images/workflows_localcommit.svg

doc/ru/user-guide/images/workflows_peer.png

doc/ru/user-guide/images/workflows_peer.svg

doc/ru/user-guide/images/workflows_pqm.png

doc/ru/user-guide/images/workflows_pqm.svg

doc/ru/user-guide/images/workflows_shared.png

doc/ru/user-guide/images/workflows_shared.svg

doc/ru/user-guide/images/workflows_single.png

doc/ru/user-guide/images/workflows_single.svg

doc/ru/user-guide/index.txt

doc/ru/user-guide/introducing_bazaar.txt

doc/ru/user-guide/specifying_revisions.txt

doc/ru/user-guide/stacked.txt

doc/ru/user-guide/using_checkouts.txt

doc/ru/user-guide/zen.txt

tools/check-newsbugs.py

files removed:
bzrlib/help_topics/en/hooks.txt

bzrlib/tests/ftp_server.py

bzrlib/tests/test_http_implementations.py

bzrlib/tests/workingtree_implementations/test_get_file_with_stat.py

bzrlib/util/configobj/docs

bzrlib/util/configobj/docs/BSD-LICENSE.txt

bzrlib/util/configobj/docs/configobj.txt

bzrlib/util/configobj/docs/validate.txt

bzrlib/util/tests/test_bencode.py

debian/patches

debian/patches/series

doc/developers/HACKING.txt

doc/developers/ec2-windows.txt

doc/developers/performance-contributing.txt

files modified:
Makefile

NEWS

bzrlib/__init__.py

bzrlib/_btree_serializer_c.c

bzrlib/_btree_serializer_c.pyx

bzrlib/_btree_serializer_py.py

bzrlib/_chunks_to_lines_py.py

bzrlib/_chunks_to_lines_pyx.c

bzrlib/_chunks_to_lines_pyx.pyx

bzrlib/_dirstate_helpers_c.c

bzrlib/_dirstate_helpers_c.pyx

bzrlib/_dirstate_helpers_py.py

bzrlib/_knit_load_data_c.c

bzrlib/_knit_load_data_c.pyx

bzrlib/_knit_load_data_py.py

bzrlib/_patiencediff_c.c

bzrlib/_patiencediff_py.py

bzrlib/_readdir_py.py

bzrlib/_readdir_pyx.c

bzrlib/_readdir_pyx.pyx

bzrlib/_walkdirs_win32.pyx

bzrlib/add.py

bzrlib/annotate.py

bzrlib/api.py

bzrlib/atomicfile.py

bzrlib/benchmarks/__init__.py

bzrlib/benchmarks/bench_add.py

bzrlib/benchmarks/bench_bench.py

bzrlib/benchmarks/bench_bundle.py

bzrlib/benchmarks/bench_cache_utf8.py

bzrlib/benchmarks/bench_checkout.py

bzrlib/benchmarks/bench_commit.py

bzrlib/benchmarks/bench_dirstate.py

bzrlib/benchmarks/bench_info.py

bzrlib/benchmarks/bench_inventory.py

bzrlib/benchmarks/bench_knit.py

bzrlib/benchmarks/bench_log.py

bzrlib/benchmarks/bench_osutils.py

bzrlib/benchmarks/bench_pack.py

bzrlib/benchmarks/bench_rocks.py

bzrlib/benchmarks/bench_sftp.py

bzrlib/benchmarks/bench_startup.py

bzrlib/benchmarks/bench_status.py

bzrlib/benchmarks/bench_transform.py

bzrlib/benchmarks/bench_workingtree.py

bzrlib/benchmarks/bench_xml.py

bzrlib/benchmarks/tree_creator/__init__.py

bzrlib/benchmarks/tree_creator/heavily_merged.py

bzrlib/benchmarks/tree_creator/kernel_like.py

bzrlib/benchmarks/tree_creator/many_commit.py

bzrlib/benchmarks/tree_creator/simple_many_commit.py

bzrlib/bisect_multi.py

bzrlib/branch.py

bzrlib/branchbuilder.py

bzrlib/breakin.py

bzrlib/btree_index.py

bzrlib/bugtracker.py

bzrlib/builtins.py

bzrlib/bundle/__init__.py

bzrlib/bundle/apply_bundle.py

bzrlib/bundle/bundle_data.py

bzrlib/bundle/commands.py

bzrlib/bundle/serializer/__init__.py

bzrlib/bundle/serializer/v08.py

bzrlib/bundle/serializer/v09.py

bzrlib/bundle/serializer/v4.py

bzrlib/bzrdir.py

bzrlib/cache_utf8.py

bzrlib/check.py

bzrlib/chunk_writer.py

bzrlib/clean_tree.py

bzrlib/cmd_version_info.py

bzrlib/commands.py

bzrlib/commit.py

bzrlib/config.py

bzrlib/conflicts.py

bzrlib/counted_lock.py

bzrlib/debug.py

bzrlib/decorators.py

bzrlib/delta.py

bzrlib/deprecated_graph.py

bzrlib/diff.py

bzrlib/directory_service.py

bzrlib/dirstate.py

bzrlib/doc/__init__.py

bzrlib/doc/api/__init__.py

bzrlib/email_message.py

bzrlib/errors.py

bzrlib/export/__init__.py

bzrlib/export/dir_exporter.py

bzrlib/export/tar_exporter.py

bzrlib/export/zip_exporter.py

bzrlib/externalcommand.py

bzrlib/fetch.py

bzrlib/fifo_cache.py

bzrlib/foreign.py

bzrlib/generate_ids.py

bzrlib/globbing.py

bzrlib/gpg.py

bzrlib/graph.py

bzrlib/hashcache.py

bzrlib/help.py

bzrlib/help_topics/__init__.py

bzrlib/help_topics/en/rules.txt

bzrlib/hooks.py

bzrlib/identitymap.py

bzrlib/ignores.py

bzrlib/index.py

bzrlib/info.py

bzrlib/inspect_for_copy.py

bzrlib/inter.py

bzrlib/intset.py

bzrlib/inventory.py

bzrlib/iterablefile.py

bzrlib/knit.py

bzrlib/lazy_import.py

bzrlib/lazy_regex.py

bzrlib/lock.py

bzrlib/lockable_files.py

bzrlib/lockdir.py

bzrlib/log.py

bzrlib/lru_cache.py

bzrlib/mail_client.py

bzrlib/memorytree.py

bzrlib/merge.py

bzrlib/merge3.py

bzrlib/merge_directive.py

bzrlib/missing.py

bzrlib/msgeditor.py

bzrlib/multiparent.py

bzrlib/mutabletree.py

bzrlib/option.py

bzrlib/osutils.py

bzrlib/pack.py

bzrlib/patch.py

bzrlib/patches.py

bzrlib/patiencediff.py

bzrlib/plugin.py

bzrlib/plugins/__init__.py

bzrlib/plugins/launchpad/__init__.py

bzrlib/plugins/launchpad/account.py

bzrlib/plugins/launchpad/lp_directory.py

bzrlib/plugins/launchpad/lp_registration.py

bzrlib/plugins/launchpad/test_account.py

bzrlib/plugins/launchpad/test_lp_directory.py

bzrlib/plugins/launchpad/test_lp_open.py

bzrlib/plugins/launchpad/test_lp_service.py

bzrlib/plugins/launchpad/test_register.py

bzrlib/plugins/netrc_credential_store/__init__.py

bzrlib/plugins/netrc_credential_store/tests/__init__.py

bzrlib/plugins/netrc_credential_store/tests/test_netrc.py

bzrlib/progress.py

bzrlib/push.py

bzrlib/python-compat.h

bzrlib/readdir.h

bzrlib/reconcile.py

bzrlib/reconfigure.py

bzrlib/registry.py

bzrlib/remote.py

bzrlib/repofmt/__init__.py

bzrlib/repofmt/knitrepo.py

bzrlib/repofmt/pack_repo.py

bzrlib/repofmt/weaverepo.py

bzrlib/repository.py

bzrlib/revision.py

bzrlib/revisionspec.py

bzrlib/revisiontree.py

bzrlib/rio.py

bzrlib/rules.py

bzrlib/shelf.py

bzrlib/shelf_ui.py

bzrlib/shellcomplete.py

bzrlib/sign_my_commits.py

bzrlib/smart/__init__.py

bzrlib/smart/branch.py

bzrlib/smart/bzrdir.py

bzrlib/smart/client.py

bzrlib/smart/medium.py

bzrlib/smart/message.py

bzrlib/smart/packrepository.py

bzrlib/smart/protocol.py

bzrlib/smart/repository.py

bzrlib/smart/request.py

bzrlib/smart/server.py

bzrlib/smart/vfs.py

bzrlib/smtp_connection.py

bzrlib/status.py

bzrlib/store/__init__.py

bzrlib/store/text.py

bzrlib/store/versioned/__init__.py

bzrlib/strace.py

bzrlib/switch.py

bzrlib/symbol_versioning.py

bzrlib/tag.py

bzrlib/testament.py

bzrlib/tests/EncodingAdapter.py

bzrlib/tests/TestUtil.py

bzrlib/tests/__init__.py

bzrlib/tests/blackbox/__init__.py

bzrlib/tests/blackbox/test_add.py

bzrlib/tests/blackbox/test_added.py

bzrlib/tests/blackbox/test_alias.py

bzrlib/tests/blackbox/test_aliases.py

bzrlib/tests/blackbox/test_ancestry.py

bzrlib/tests/blackbox/test_annotate.py

bzrlib/tests/blackbox/test_bound_branches.py

bzrlib/tests/blackbox/test_branch.py

bzrlib/tests/blackbox/test_break_lock.py

bzrlib/tests/blackbox/test_breakin.py

bzrlib/tests/blackbox/test_bundle_info.py

bzrlib/tests/blackbox/test_cat.py

bzrlib/tests/blackbox/test_cat_revision.py

bzrlib/tests/blackbox/test_check.py

bzrlib/tests/blackbox/test_checkout.py

bzrlib/tests/blackbox/test_clean_tree.py

bzrlib/tests/blackbox/test_command_encoding.py

bzrlib/tests/blackbox/test_commit.py

bzrlib/tests/blackbox/test_conflicts.py

bzrlib/tests/blackbox/test_debug.py

bzrlib/tests/blackbox/test_diff.py

bzrlib/tests/blackbox/test_dump_btree.py

bzrlib/tests/blackbox/test_exceptions.py

bzrlib/tests/blackbox/test_export.py

bzrlib/tests/blackbox/test_filesystem_cicp.py

bzrlib/tests/blackbox/test_filtered_view_ops.py

bzrlib/tests/blackbox/test_find_merge_base.py

bzrlib/tests/blackbox/test_help.py

bzrlib/tests/blackbox/test_hooks.py

bzrlib/tests/blackbox/test_ignore.py

bzrlib/tests/blackbox/test_ignored.py

bzrlib/tests/blackbox/test_info.py

bzrlib/tests/blackbox/test_init.py

bzrlib/tests/blackbox/test_inventory.py

bzrlib/tests/blackbox/test_join.py

bzrlib/tests/blackbox/test_locale.py

bzrlib/tests/blackbox/test_log.py

bzrlib/tests/blackbox/test_logformats.py

bzrlib/tests/blackbox/test_ls.py

bzrlib/tests/blackbox/test_lsprof.py

bzrlib/tests/blackbox/test_merge.py

bzrlib/tests/blackbox/test_merge_directive.py

bzrlib/tests/blackbox/test_missing.py

bzrlib/tests/blackbox/test_modified.py

bzrlib/tests/blackbox/test_mv.py

bzrlib/tests/blackbox/test_nick.py

bzrlib/tests/blackbox/test_non_ascii.py

bzrlib/tests/blackbox/test_outside_wt.py

bzrlib/tests/blackbox/test_pack.py

bzrlib/tests/blackbox/test_pull.py

bzrlib/tests/blackbox/test_push.py

bzrlib/tests/blackbox/test_re_sign.py

bzrlib/tests/blackbox/test_reconcile.py

bzrlib/tests/blackbox/test_reconfigure.py

bzrlib/tests/blackbox/test_remerge.py

bzrlib/tests/blackbox/test_remove.py

bzrlib/tests/blackbox/test_remove_tree.py

bzrlib/tests/blackbox/test_revert.py

bzrlib/tests/blackbox/test_revision_history.py

bzrlib/tests/blackbox/test_revision_info.py

bzrlib/tests/blackbox/test_revno.py

bzrlib/tests/blackbox/test_selftest.py

bzrlib/tests/blackbox/test_send.py

bzrlib/tests/blackbox/test_serve.py

bzrlib/tests/blackbox/test_shared_repository.py

bzrlib/tests/blackbox/test_shelve.py

bzrlib/tests/blackbox/test_sign_my_commits.py

bzrlib/tests/blackbox/test_split.py

bzrlib/tests/blackbox/test_status.py

bzrlib/tests/blackbox/test_switch.py

bzrlib/tests/blackbox/test_tags.py

bzrlib/tests/blackbox/test_testament.py

bzrlib/tests/blackbox/test_too_much.py

bzrlib/tests/blackbox/test_uncommit.py

bzrlib/tests/blackbox/test_unknowns.py

bzrlib/tests/blackbox/test_update.py

bzrlib/tests/blackbox/test_upgrade.py

bzrlib/tests/blackbox/test_version.py

bzrlib/tests/blackbox/test_version_info.py

bzrlib/tests/blackbox/test_versioning.py

bzrlib/tests/blackbox/test_view.py

bzrlib/tests/blackbox/test_whoami.py

bzrlib/tests/branch_implementations/__init__.py

bzrlib/tests/branch_implementations/test_bound_sftp.py

bzrlib/tests/branch_implementations/test_branch.py

bzrlib/tests/branch_implementations/test_break_lock.py

bzrlib/tests/branch_implementations/test_check.py

bzrlib/tests/branch_implementations/test_commit.py

bzrlib/tests/branch_implementations/test_create_checkout.py

bzrlib/tests/branch_implementations/test_create_clone.py

bzrlib/tests/branch_implementations/test_dotted_revno_to_revision_id.py

bzrlib/tests/branch_implementations/test_get_revision_id_to_revno_map.py

bzrlib/tests/branch_implementations/test_hooks.py

bzrlib/tests/branch_implementations/test_http.py

bzrlib/tests/branch_implementations/test_iter_merge_sorted_revisions.py

bzrlib/tests/branch_implementations/test_last_revision_info.py

bzrlib/tests/branch_implementations/test_locking.py

bzrlib/tests/branch_implementations/test_parent.py

bzrlib/tests/branch_implementations/test_permissions.py

bzrlib/tests/branch_implementations/test_pull.py

bzrlib/tests/branch_implementations/test_push.py

bzrlib/tests/branch_implementations/test_reconcile.py

bzrlib/tests/branch_implementations/test_revision_history.py

bzrlib/tests/branch_implementations/test_revision_id_to_dotted_revno.py

bzrlib/tests/branch_implementations/test_revision_id_to_revno.py

bzrlib/tests/branch_implementations/test_sprout.py

bzrlib/tests/branch_implementations/test_stacking.py

bzrlib/tests/branch_implementations/test_tags.py

bzrlib/tests/branch_implementations/test_uncommit.py

bzrlib/tests/branch_implementations/test_update.py

bzrlib/tests/bzrdir_implementations/__init__.py

bzrlib/tests/bzrdir_implementations/test_bzrdir.py

bzrlib/tests/commands/__init__.py

bzrlib/tests/commands/test_branch.py

bzrlib/tests/commands/test_cat.py

bzrlib/tests/commands/test_checkout.py

bzrlib/tests/commands/test_commit.py

bzrlib/tests/commands/test_init.py

bzrlib/tests/commands/test_init_repository.py

bzrlib/tests/commands/test_merge.py

bzrlib/tests/commands/test_missing.py

bzrlib/tests/commands/test_pull.py

bzrlib/tests/commands/test_push.py

bzrlib/tests/commands/test_update.py

bzrlib/tests/fake_command.py

bzrlib/tests/file_utils.py

bzrlib/tests/http_server.py

bzrlib/tests/http_utils.py

bzrlib/tests/https_server.py

bzrlib/tests/interrepository_implementations/__init__.py

bzrlib/tests/interrepository_implementations/test_fetch.py

bzrlib/tests/interrepository_implementations/test_interrepository.py

bzrlib/tests/intertree_implementations/__init__.py

bzrlib/tests/intertree_implementations/test_compare.py

bzrlib/tests/inventory_implementations/__init__.py

bzrlib/tests/inventory_implementations/basics.py

bzrlib/tests/lock_helpers.py

bzrlib/tests/per_interbranch/__init__.py

bzrlib/tests/per_interbranch/test_update_revisions.py

bzrlib/tests/per_lock/__init__.py

bzrlib/tests/per_lock/test_lock.py

bzrlib/tests/per_lock/test_temporary_write_lock.py

bzrlib/tests/per_repository/__init__.py

bzrlib/tests/per_repository/helpers.py

bzrlib/tests/per_repository/test__generate_text_key_index.py

bzrlib/tests/per_repository/test_add_fallback_repository.py

bzrlib/tests/per_repository/test_add_inventory_by_delta.py

bzrlib/tests/per_repository/test_break_lock.py

bzrlib/tests/per_repository/test_check.py

bzrlib/tests/per_repository/test_check_reconcile.py

bzrlib/tests/per_repository/test_commit_builder.py

bzrlib/tests/per_repository/test_fetch.py

bzrlib/tests/per_repository/test_fileid_involved.py

bzrlib/tests/per_repository/test_find_text_key_references.py

bzrlib/tests/per_repository/test_get_parent_map.py

bzrlib/tests/per_repository/test_has_revisions.py

bzrlib/tests/per_repository/test_has_same_location.py

bzrlib/tests/per_repository/test_is_write_locked.py

bzrlib/tests/per_repository/test_iter_reverse_revision_history.py

bzrlib/tests/per_repository/test_pack.py

bzrlib/tests/per_repository/test_reconcile.py

bzrlib/tests/per_repository/test_repository.py

bzrlib/tests/per_repository/test_revision.py

bzrlib/tests/per_repository/test_statistics.py

bzrlib/tests/per_repository/test_write_group.py

bzrlib/tests/per_repository_reference/__init__.py

bzrlib/tests/per_repository_reference/test_add_inventory.py

bzrlib/tests/per_repository_reference/test_add_revision.py

bzrlib/tests/per_repository_reference/test_add_signature_text.py

bzrlib/tests/per_repository_reference/test_all_revision_ids.py

bzrlib/tests/per_repository_reference/test_break_lock.py

bzrlib/tests/per_repository_reference/test_check.py

bzrlib/tests/per_repository_reference/test_default_stacking.py

bzrlib/tests/ssl_certs/__init__.py

bzrlib/tests/ssl_certs/create_ssls.py

bzrlib/tests/stub_sftp.py

bzrlib/tests/test__chunks_to_lines.py

bzrlib/tests/test__dirstate_helpers.py

bzrlib/tests/test__walkdirs_win32.py

bzrlib/tests/test_ancestry.py

bzrlib/tests/test_annotate.py

bzrlib/tests/test_api.py

bzrlib/tests/test_atomicfile.py

bzrlib/tests/test_bad_files.py

bzrlib/tests/test_bisect_multi.py

bzrlib/tests/test_branch.py

bzrlib/tests/test_branchbuilder.py

bzrlib/tests/test_btree_index.py

bzrlib/tests/test_bugtracker.py

bzrlib/tests/test_bundle.py

bzrlib/tests/test_bzrdir.py

bzrlib/tests/test_cache_utf8.py

bzrlib/tests/test_chunk_writer.py

bzrlib/tests/test_clean_tree.py

bzrlib/tests/test_commands.py

bzrlib/tests/test_commit.py

bzrlib/tests/test_commit_merge.py

bzrlib/tests/test_config.py

bzrlib/tests/test_conflicts.py

bzrlib/tests/test_counted_lock.py

bzrlib/tests/test_debug.py

bzrlib/tests/test_decorators.py

bzrlib/tests/test_delta.py

bzrlib/tests/test_deprecated_graph.py

bzrlib/tests/test_diff.py

bzrlib/tests/test_directory_service.py

bzrlib/tests/test_dirstate.py

bzrlib/tests/test_email_message.py

bzrlib/tests/test_errors.py

bzrlib/tests/test_export.py

bzrlib/tests/test_extract.py

bzrlib/tests/test_fetch.py

bzrlib/tests/test_fifo_cache.py

bzrlib/tests/test_foreign.py

bzrlib/tests/test_ftp_transport.py

bzrlib/tests/test_generate_docs.py

bzrlib/tests/test_generate_ids.py

bzrlib/tests/test_globbing.py

bzrlib/tests/test_gpg.py

bzrlib/tests/test_graph.py

bzrlib/tests/test_hashcache.py

bzrlib/tests/test_help.py

bzrlib/tests/test_hooks.py

bzrlib/tests/test_http.py

bzrlib/tests/test_http_response.py

bzrlib/tests/test_https_ca_bundle.py

bzrlib/tests/test_identitymap.py

bzrlib/tests/test_ignores.py

bzrlib/tests/test_index.py

bzrlib/tests/test_info.py

bzrlib/tests/test_inv.py

bzrlib/tests/test_knit.py

bzrlib/tests/test_lazy_import.py

bzrlib/tests/test_lazy_regex.py

bzrlib/tests/test_lockable_files.py

bzrlib/tests/test_lockdir.py

bzrlib/tests/test_log.py

bzrlib/tests/test_lru_cache.py

bzrlib/tests/test_lsprof.py

bzrlib/tests/test_mail_client.py

bzrlib/tests/test_memorytree.py

bzrlib/tests/test_merge.py

bzrlib/tests/test_merge3.py

bzrlib/tests/test_merge_core.py

bzrlib/tests/test_merge_directive.py

bzrlib/tests/test_missing.py

bzrlib/tests/test_msgeditor.py

bzrlib/tests/test_multiparent.py

bzrlib/tests/test_mutabletree.py

bzrlib/tests/test_nonascii.py

bzrlib/tests/test_options.py

bzrlib/tests/test_osutils.py

bzrlib/tests/test_osutils_encodings.py

bzrlib/tests/test_pack.py

bzrlib/tests/test_pack_repository.py

bzrlib/tests/test_patch.py

bzrlib/tests/test_patches.py

bzrlib/tests/test_patches_data/diff-6

bzrlib/tests/test_patches_data/mod

bzrlib/tests/test_patches_data/mod-2

bzrlib/tests/test_patches_data/mod-3

bzrlib/tests/test_patches_data/mod-4

bzrlib/tests/test_patches_data/mod-5

bzrlib/tests/test_patches_data/orig

bzrlib/tests/test_patches_data/orig-2

bzrlib/tests/test_patches_data/orig-3

bzrlib/tests/test_patches_data/orig-4

bzrlib/tests/test_patches_data/orig-5

bzrlib/tests/test_patches_data/orig-6

bzrlib/tests/test_permissions.py

bzrlib/tests/test_plugins.py

bzrlib/tests/test_progress.py

bzrlib/tests/test_read_bundle.py

bzrlib/tests/test_reconcile.py

bzrlib/tests/test_reconfigure.py

bzrlib/tests/test_registry.py

bzrlib/tests/test_remote.py

bzrlib/tests/test_repository.py

bzrlib/tests/test_revert.py

bzrlib/tests/test_revision.py

bzrlib/tests/test_revisionspec.py

bzrlib/tests/test_revisiontree.py

bzrlib/tests/test_rio.py

bzrlib/tests/test_rules.py

bzrlib/tests/test_sampler.py

bzrlib/tests/test_selftest.py

bzrlib/tests/test_setup.py

bzrlib/tests/test_sftp_transport.py

bzrlib/tests/test_shelf.py

bzrlib/tests/test_shelf_ui.py

bzrlib/tests/test_smart.py

bzrlib/tests/test_smart_add.py

bzrlib/tests/test_smart_request.py

bzrlib/tests/test_smart_transport.py

bzrlib/tests/test_smtp_connection.py

bzrlib/tests/test_source.py

bzrlib/tests/test_ssh_transport.py

bzrlib/tests/test_status.py

bzrlib/tests/test_store.py

bzrlib/tests/test_strace.py

bzrlib/tests/test_subsume.py

bzrlib/tests/test_switch.py

bzrlib/tests/test_symbol_versioning.py

bzrlib/tests/test_tag.py

bzrlib/tests/test_testament.py

bzrlib/tests/test_textfile.py

bzrlib/tests/test_textmerge.py

bzrlib/tests/test_timestamp.py

bzrlib/tests/test_trace.py

bzrlib/tests/test_transactions.py

bzrlib/tests/test_transform.py

bzrlib/tests/test_transport.py

bzrlib/tests/test_transport_implementations.py

bzrlib/tests/test_transport_log.py

bzrlib/tests/test_tree.py

bzrlib/tests/test_treebuilder.py

bzrlib/tests/test_tsort.py

bzrlib/tests/test_tuned_gzip.py

bzrlib/tests/test_ui.py

bzrlib/tests/test_uncommit.py

bzrlib/tests/test_upgrade.py

bzrlib/tests/test_upgrade_stacked.py

bzrlib/tests/test_urlutils.py

bzrlib/tests/test_version.py

bzrlib/tests/test_version_info.py

bzrlib/tests/test_versionedfile.py

bzrlib/tests/test_weave.py

bzrlib/tests/test_whitebox.py

bzrlib/tests/test_win32utils.py

bzrlib/tests/test_workingtree.py

bzrlib/tests/test_workingtree_4.py

bzrlib/tests/test_wsgi.py

bzrlib/tests/test_xml.py

bzrlib/tests/transport_util.py

bzrlib/tests/tree_implementations/__init__.py

bzrlib/tests/tree_implementations/test_annotate_iter.py

bzrlib/tests/tree_implementations/test_get_file_mtime.py

bzrlib/tests/tree_implementations/test_get_root_id.py

bzrlib/tests/tree_implementations/test_get_symlink_target.py

bzrlib/tests/tree_implementations/test_inv.py

bzrlib/tests/tree_implementations/test_iter_search_rules.py

bzrlib/tests/tree_implementations/test_list_files.py

bzrlib/tests/tree_implementations/test_path_content_summary.py

bzrlib/tests/tree_implementations/test_revision_tree.py

bzrlib/tests/tree_implementations/test_test_trees.py

bzrlib/tests/tree_implementations/test_tree.py

bzrlib/tests/tree_implementations/test_walkdirs.py

bzrlib/tests/treeshape.py

bzrlib/tests/workingtree_implementations/__init__.py

bzrlib/tests/workingtree_implementations/test_add.py

bzrlib/tests/workingtree_implementations/test_add_reference.py

bzrlib/tests/workingtree_implementations/test_basis_inventory.py

bzrlib/tests/workingtree_implementations/test_basis_tree.py

bzrlib/tests/workingtree_implementations/test_break_lock.py

bzrlib/tests/workingtree_implementations/test_changes_from.py

bzrlib/tests/workingtree_implementations/test_commit.py

bzrlib/tests/workingtree_implementations/test_executable.py

bzrlib/tests/workingtree_implementations/test_flush.py

bzrlib/tests/workingtree_implementations/test_get_file_mtime.py

bzrlib/tests/workingtree_implementations/test_get_parent_ids.py

bzrlib/tests/workingtree_implementations/test_inv.py

bzrlib/tests/workingtree_implementations/test_is_control_filename.py

bzrlib/tests/workingtree_implementations/test_is_ignored.py

bzrlib/tests/workingtree_implementations/test_locking.py

bzrlib/tests/workingtree_implementations/test_merge_from_branch.py

bzrlib/tests/workingtree_implementations/test_mkdir.py

bzrlib/tests/workingtree_implementations/test_move.py

bzrlib/tests/workingtree_implementations/test_nested_specifics.py

bzrlib/tests/workingtree_implementations/test_parents.py

bzrlib/tests/workingtree_implementations/test_paths2ids.py

bzrlib/tests/workingtree_implementations/test_pull.py

bzrlib/tests/workingtree_implementations/test_put_file.py

bzrlib/tests/workingtree_implementations/test_read_working_inventory.py

bzrlib/tests/workingtree_implementations/test_readonly.py

bzrlib/tests/workingtree_implementations/test_remove.py

bzrlib/tests/workingtree_implementations/test_rename_one.py

bzrlib/tests/workingtree_implementations/test_revision_tree.py

bzrlib/tests/workingtree_implementations/test_set_root_id.py

bzrlib/tests/workingtree_implementations/test_smart_add.py

bzrlib/tests/workingtree_implementations/test_uncommit.py

bzrlib/tests/workingtree_implementations/test_unversion.py

bzrlib/tests/workingtree_implementations/test_views.py

bzrlib/tests/workingtree_implementations/test_walkdirs.py

bzrlib/tests/workingtree_implementations/test_workingtree.py

bzrlib/textfile.py

bzrlib/textinv.py

bzrlib/textmerge.py

bzrlib/textui.py

bzrlib/timestamp.py

bzrlib/trace.py

bzrlib/transactions.py

bzrlib/transform.py

bzrlib/transport/__init__.py

bzrlib/transport/brokenrename.py

bzrlib/transport/chroot.py

bzrlib/transport/decorator.py

bzrlib/transport/fakenfs.py

bzrlib/transport/fakevfat.py

bzrlib/transport/ftp/__init__.py

bzrlib/transport/ftp/_gssapi.py

bzrlib/transport/http/__init__.py

bzrlib/transport/http/_pycurl.py

bzrlib/transport/http/_urllib.py

bzrlib/transport/http/_urllib2_wrappers.py

bzrlib/transport/http/ca_bundle.py

bzrlib/transport/http/response.py

bzrlib/transport/http/wsgi.py

bzrlib/transport/local.py

bzrlib/transport/log.py

bzrlib/transport/memory.py

bzrlib/transport/nosmart.py

bzrlib/transport/readonly.py

bzrlib/transport/remote.py

bzrlib/transport/sftp.py

bzrlib/transport/ssh.py

bzrlib/transport/trace.py

bzrlib/transport/unlistable.py

bzrlib/tree.py

bzrlib/treebuilder.py

bzrlib/tsort.py

bzrlib/tuned_gzip.py

bzrlib/ui/__init__.py

bzrlib/ui/text.py

bzrlib/uncommit.py

bzrlib/upgrade.py

bzrlib/urlutils.py

bzrlib/util/bencode.py

bzrlib/util/configobj/configobj.py

bzrlib/version.py

bzrlib/version_info_formats/__init__.py

bzrlib/version_info_formats/format_custom.py

bzrlib/version_info_formats/format_python.py

bzrlib/version_info_formats/format_rio.py

bzrlib/versionedfile.py

bzrlib/views.py

bzrlib/weave.py

bzrlib/weave_commands.py

bzrlib/weavefile.py

bzrlib/win32utils.py

bzrlib/workingtree.py

bzrlib/workingtree_4.py

bzrlib/xml4.py

bzrlib/xml5.py

bzrlib/xml6.py

bzrlib/xml7.py

bzrlib/xml8.py

bzrlib/xml_serializer.py

contrib/bzr_access

contrib/newinventory.py

contrib/pwclient.full

debian/TODO

debian/changelog

debian/control

doc/developers/container-format.txt

doc/developers/development-repo.txt

doc/developers/incremental-push-pull.txt

doc/developers/index.txt

doc/developers/inventory.txt

doc/developers/lca-merge.txt

doc/developers/merge-scaling.txt

doc/developers/network-protocol.txt

doc/developers/overview.txt

doc/developers/performance-roadmap.txt

doc/developers/planned-change-integration.txt

doc/developers/planned-performance-changes.txt

doc/developers/releasing.txt

doc/developers/revision-properties.txt

doc/en/quick-reference/Makefile

doc/en/quick-reference/quick-start-summary.png

doc/en/tutorials/using_bazaar_with_launchpad.txt

doc/en/user-guide/browsing_history.txt

doc/en/user-guide/bzrtools_plugin.txt

doc/en/user-guide/filtered_views.txt

doc/en/user-guide/images/workflows_centralized.png

doc/en/user-guide/images/workflows_gatekeeper.png

doc/en/user-guide/images/workflows_localcommit.png

doc/en/user-guide/images/workflows_peer.png

doc/en/user-guide/images/workflows_pqm.png

doc/en/user-guide/images/workflows_shared.png

doc/en/user-guide/images/workflows_single.png

doc/en/user-guide/index.txt

doc/en/user-guide/installing_bazaar.txt

doc/en/user-guide/introducing_bazaar.txt

doc/en/user-guide/releasing_a_project.txt

doc/en/user-guide/server.txt

doc/en/user-guide/specifying_revisions.txt

doc/en/user-guide/svn_plugin.txt

doc/en/user-guide/web_browsing.txt

doc/index.txt

generate_docs.py

profile_imports.py

setup.py

tools/bzr_epydoc_uid.py

tools/convertfile.py

tools/convertinv.py

tools/doc_generate/__init__.py

tools/doc_generate/autodoc_bash_completion.py

tools/doc_generate/autodoc_man.py

tools/doc_generate/autodoc_rstx.py

tools/history2revfiles.py

tools/package_mf.py

tools/prepare_for_latex.py

tools/weavebench.py

tools/win32/build_release.py

tools/win32/bzr_postinstall.py

Show diffs side-by-side

added added

removed removed

doc/developers/improved_chk_index.txt

===================

CHK Optimized index

===================

Our current btree style index is nice as a general index, but it is not optimal

for Content-Hash-Key based content. With CHK, the keys themselves are hashes,

which means they are randomly distributed (similar keys do not refer to

similar content), and they do not compress well. However, we can create an

index which takes advantage of these abilites, rather than suffering from

them. Even further, there are specific advantages provided by

``groupcompress``, because of how individual items are clustered together.

Btree indexes also rely on zlib compression, in order to get their compact

size, and further has to try hard to fit things into a compressed 4k page.

When the key is a sha1 hash, we would not expect to get better than 20bytes

per key, which is the same size as the binary representation of the hash. This

means we could write an index format that gets approximately the same on-disk

size, without having the overhead of ``zlib.decompress``. Some thought would

still need to be put into how to efficiently access these records from remote.

Required information

====================

For a given groupcompress record, we need to know the offset and length of the

compressed group in the .pack file, and the start and end of the content inside

the uncompressed group. The absolute minimum is slightly less, but this is a

good starting point. The other thing to consider, is that for 1M revisions and

1M files, we'll probably have 10-20M CHK pages, so we want to make sure we

have an index that can scale up efficiently.

1. A compressed sha hash is 20-bytes

2. Pack files can be > 4GB, we could use an 8-byte (64-bit) pointer, or we

could store a 5-byte pointer for a cap at 1TB. 8-bytes still seems like

overkill, even if it is the natural next size up.

3. An individual group would never be longer than 2^32, but they will often

be bigger than 2^16. 3 bytes for length (16MB) would be the minimum safe

length, and may not be safe if we expand groups for large content (like ISOs).

So probably 4-bytes for group length is necessary.

4. A given start offset has to fit in the group, so another 4-bytes.

5. Uncompressed length of record is based on original size, so 4-bytes is

expected as well.

6. That leaves us with 20+8+4+4+4 = 40 bytes per record. At the moment, btree

compression gives us closer to 38.5 bytes per record. We don't have perfect

compression, but we also don't have >4GB pack files (and if we did, the first

4GB are all under then 2^32 barrier :).

If we wanted to go back to the ''minimal'' amount of data that we would need to

store.

1. 8 bytes of a sha hash are generally going to be more than enough to fully

determine the entry (see `Partial hash`_). We could support some amount of

collision in an index record, in exchange for resolving it inside the

content. At least in theory, we don't *have* to record the whole 20-bytes

for the sha1 hash. (8-bytes gives us less than 1 in 1000 chance of

a single collision for 10M nodes in an index)

2. We could record the start and length of each group in a separate location,

and then have each record reference the group by an 'offset'. This is because

we expect to have many records in the same group (something like 10k or so,

though we've fit >64k under some circumstances). At a minimum, we have one

record per group so we have to store at least one reference anyway. So the

maximum overhead is just the size and cost of the dereference (and normally

will be much much better than that.)

3. If a group reference is an 8-byte start, and a 4-byte length, and we have

10M keys, but get at least 1k records per group, then we would have 10k

groups. So we would need 120kB to record all the group offsets, and then

each individual record would only need a 2-byte group number, rather than a

12-byte reference. We could be safe with a 4-byte group number, but if

each group is ~1MB, 64k groups is 64GB. We can start with 2-byte, but leave

room in the header info to indicate if we have more than 64k group entries.

Also, current grouping creates groups of 4MB each, which would make it

256GB, to create 64k groups. And our current chk pages compress down to

less than 100 bytes each (average is closer to 40 bytes), which for 256GB

of raw data, would amount to 2.7 billion CHK records. (This will change if

we start to use CHK for text records, as they do not compress down as

small.) Using 100 bytes per 10M chk records, we have 1GB of compressed chk

data, split into 4MB groups or 250 total groups. Still << 64k groups.

Conversions could create 1 chk record at a time, creating a group for each,

but they would be foolish to not commit a write group after 10k revisions

(assuming 6 CHK pages each).

4. We want to know the start-and-length of a record in the decompressed

stream. This could actually be moved into a mini-index inside the group

itself. Initial testing showed that storing an expanded "key =>

start,offset" consumed a considerable amount of compressed space. (about

30% of final size was just these internal indices.) However, we could move

to a pure "record 1 is at location 10-20", and then our external index

would just have a single 'group entry number'.

There are other internal forces that would give a natural cap of 64k

entries per group. So without much loss of generality, we could probably get

away with a 2-byte 'group entry' number. (which then generates an 8-byte

offset + endpoint as a header in the group itself.)

100

101

5. So for 1M keys, an ideal chk+group index would be:

102

103

a. 6-byte hash prefix

104

105

b. 2-byte group number

106

107

c. 2-byte entry in group number

108

109

d. a separate lookup of 12-byte group number to offset + length

110

111

e. a variable width mini-index that splits X bits of the key. (to maintain

112

small keys, low chance of collision, this is *not* redundant with the

113

value stored in (a)) This should then dereference into a location in

114

the index. This should probably be a 4-byte reference. It is unlikely,

115

but possible, to have an index >16MB. With an 10-byte entry, it only

116

takes 1.6M chk nodes to do so. At the smallest end, this will probably

117

be a 256-way (8-bits) fan out, at the high end it could go up to

118

64k-way (16-bits) or maybe even 1M-way (20-bits). (64k-way should

119

handle up to 5-16M nodes and still allow a cheap <4k read to find the

120

final entry.)

121

122

So the max size for the optimal groupcompress+chk index with 10M entries would be::

123

124

10 * 10M (entries) + 64k * 12 (group) + 64k * 4 (mini index) = 101 MiB

125

126

So 101MiB which breaks down as 100MiB for the actual entries, 0.75MiB for the

127

group records, and 0.25MiB for the mini index.

128

129

1. Looking up a key would involve:

130

131

a. Read ``XX`` bytes to get the header, and various config for the index.

132

Such as length of the group records, length of mini index, etc.

133

134

b. Find the offset in the mini index for the first YY bits of the key. Read

135

the 4 byte pointer stored at that location (which may already be in the

136

first content if we pre-read a minimum size.)

137

138

c. Jump to the location indicated, and read enough bytes to find the

139

correct 12-byte record. The mini-index only indicates the start of

140

records that start with the given prefix. A 64k-way index resolves 10MB

141

records down to 160 possibilities. So at 12 bytes each, to read all would

142

cost 1920 bytes to be read.

143

144

d. Determine the offset for the group entry, which is the known ``start of

145

groups`` location + 12B*offset number. Read its 12-byte record.

146

147

e. Switch to the .pack file, and read the group header to determine where in

148

the stream the given record exists. At this point, you have enough

149

information to read the entire group block. For local ops, you could

150

only read enough to get the header, and then only read enough to

151

decompress just the content you want to get at.

152

153

Using an offset, you also don't need to decode the entire group header.

154

If we assume that things are stored in fixed-size records, you can jump

155

to exactly the entry that you care about, and read its 8-byte

156

(start,length in uncompressed) info. If we wanted more redundancy we

157

could store the 20-byte hash, but the content can verify itself.

158

159

f. If the size of these mini headers becomes critical (8 bytes per record

160

is 8% overhead for 100 byte records), we could also compress this mini

161

header. Changing the number of bytes per entry is unlikely to be

162

efficient, because groups standardize on 4MiB wide, which is >>64KiB for

163

a 2-byte offset, 3-bytes would be enough as long as we never store an

164

ISO as a single entry in the content. Variable width also isn't a big

165

win, since base-128 hits 4-bytes at just 2MiB.

166

167

For minimum size without compression, we could only store the 4-byte

168

length of each node. Then to compute the offset, you have to sum all

169

previous nodes. We require <64k nodes in a group, so it is up to 256KiB

170

for this header, but we would lose partial reads. This should still be

171

cheap in compiled code (needs tests, as you can't do partial info), and

172

would also have the advantage that fixed width would be highly

173

compressible itself. (Most nodes are going to have a length that fits

174

1-2 bytes.)

175

176

An alternative form would be to use the base-128 encoding. (If the MSB

177

is set, then the next byte needs to be added to the current value

178

shifted by 7*n bits.) This encodes 4GiB in 5 bytes, but stores 127B in 1

179

byte, and 2MiB in 3 bytes. If we only stored 64k entries in a 4 MiB

180

group, the average size can only be 64B, which fits in a single byte

181

length, so 64KiB for this header, or only 1.5% overhead. We also don't

182

have to compute the offset of *all* nodes, just the ones before the one

183

we want, which is the similar to what we have to do to get the actual

184

content out.

185

186

187

Partial Hash

188

============

189

The size of the index is dominated by the individual entries (the 1M records).

190

Saving 1 byte there saves 1MB overall, which is the same as the group entries

191

and mini index combined. If we can change the index so that it can handle

192

collisions gracefully (have multiple records for a given collision), then we

193

can shrink the number of bytes we need overall. Also, if we aren't going to

194

put the full 20-bytes into the index, then some form of graceful handling of

195

collisions is recommended anyway.

196

197

The current structure does this just fine, in that the mini-index dereferences

198

you to a "list" of records that start with that prefix. It is assumed that

199

those would be sorted, but we could easily have multiple records. To resolve

200

the exact record, you can read both records, and compute the sha1 to decide

201

between them. This has performance implications, as you are now decoding 2x

202

the records to get at one.

203

204

The chance of ``n`` texts colliding with a hash space of ``H`` is generally

205

given as::

206

207

1 - e ^(-n^2 / 2 H)

208

209

Or if you use ``H = 2^h``, where ``h`` is the number of bits::

210

211

1 - e ^(-n^2 / 2^(h+1))

212

213

For 1M keys and 4-bytes (32-bit), the chance of collision is for all intents

214

and purposes 100%. Rewriting the equation to give the number of bits (``h``)

215

needed versus the number of entries (``n``) and the desired collision rate

216

(``epsilon``)::

217

218

h = log_2(-n^2 / ln(1-epsilon)) - 1

219

220

The denominator ``ln(1-epsilon)`` == ``-epsilon``` for small values (even @0.1

221

== -0.105, and we are assuming we want a much lower chance of collision than

222

10%). So we have::

223

224

h = log_2(n^2/epsilon) - 1 = 2 log_2(n) - log_2(epsilon) - 1

225

226

Given that ``epsilon`` will often be very small and ``n`` very large, it can

227

be more convenient to transform it into ``epsilon = 10^-E`` and ``n = 10^N``,

228

which gives us::

229

230

h = 2 * log_2(10^N) - 2 log_2(10^-E) - 1

231

h = log_2(10) (2N + E) - 1

232

h ~ 3.3 (2N + E) - 1

233

234

Or if we use number of bytes ``h = 8H``::

235

236

H ~ 0.4 (2N + E)

237

238

This actually has some nice understanding to be had. For every order of

239

magnitude we want to increase the number of keys (at the same chance of

240

collision), we need ~1 byte (0.8), for every two orders of magnitude we want

241

to reduce the chance of collision we need the same extra bytes. So with 8

242

bytes, you can have 20 orders of magnitude to work with, 10^10 keys, with

243

guaranteed collision, or 10 keys with 10^-20 chance of collision.

244

245

Putting this in a different form, we could make ``epsilon == 1/n``. This gives

246

us an interesting simplified form::

247

248

h = log_2(n^3) - 1 = 3 log_2(n) - 1

249

250

writing ``n`` as ``10^N``, and ``H=8h``::

251

252

h = 3 N log_2(10) - 1 =~ 10 N - 1

253

H ~ 1.25 N

254

255

So to have a one in a million chance of collision using 1 million keys, you

256

need ~59 bits, or slightly more than 7 bytes. For 10 million keys and a one in

257

10 million chance of any of them colliding, you can use 9 (8.6) bytes. With 10

258

bytes, we have a one in a 100M chance of getting a collision in 100M keys

259

(substituting back, the original equation says the chance of collision is 4e-9

260

for 100M keys when using 10 bytes.)

261

262

Given that the only cost for a collision is reading a second page and ensuring

263

the sha hash actually matches we could actually use a fairly "high" collision

264

rate. A chance of 1 in 1000 that you will collide in an index with 1M keys is

265

certainly acceptible. (note that isn't 1 in 1000 of those keys will be a

266

collision, but 1 in 1000 that you will have a *single* collision). Using a

267

collision chance of 10^-3, and number of keys 10^6, means we need (12+3)*0.4 =

268

6 bytes. For 10M keys, you need (14+3)*0.4 = 6.8 aka 7. We get that extra byte

269

from the ``mini-index``. In an index with a lot of keys, you want a bigger

270

fan-out up front anyway, which gives you more bytes consumed and extends your

271

effective key width.

272

273

Also taking one more look at ``H ~ 0.4 (2N + E)``, you can rearrange and

274

consider that for every order of magnitude more keys you insert, your chance

275

for collision goes up by 2 orders of magnitude. But for 100M keys, 8 bytes

276

gives you a 1 in 10,000 chance of collision, and that is gotten at a 16-bit

277

fan-out (64k-way), but for 100M keys, we would likely want at least 20-bit fan

278

out.

279

280

You can also see this from the original equation with a bit of rearranging::

281

282

epsilon = 1 - e^(-n^2 / 2^(h+1))

283

epsilon = 1 - e^(-(2^N)^2 / (2^(h+1))) = 1 - e^(-(2^(2N))(2^-(h+1)))

284

= 1 - e^(-(2^(2N - h - 1)))

285

286

Such that you want ``2N - h`` to be a very negative integer, such that

287

``2^-X`` is thus very close to zero, and ``1-e^0 = 0``. But you can see that

288

if you want to double the number of source texts, you need to quadruple the

289

number of bits.

290

291

292

Scaling Sizes

293

=============

294

295

Scaling up

296

----------

297

298

We have said we want to be able to scale to a tree with 1M files and 1M

299

commits. With a 255-way fan out for chk pages, you need 2 internal nodes,

300

and a leaf node with 16 items. (You maintain 2 internal nodes up until 16.5M

301

nodes, when you get another internal node, and your leaf nodes shrink down to

302

1 again.) If we assume every commit averages 10 changes (large, but possible,

303

especially with large merges), then you get 1 root + 10*(1 internal + 1 leaf

304

node) per commit or 21 nodes per commit. At 1M revisions, that is 21M chk

305

nodes. So to support the 1Mx1M project, we really need to consider having up

306

to 100M chk nodes.

307

308

Even if you went up to 16M tree nodes, that only bumps us up to 31M chk

309

nodes. Though it also scales by number of changes, so if you had a huge churn,

310

and had 100 changes per commit and a 16M node tree, you would have 301M chk

311

nodes. Note that 8 bytes (64-bits) in the prefix still only gives us a 0.27%

312

chance of collision (1 in 370). Or if you had 370 projects of that size, with

313

all different content, *one* of them would have a collision in the index.

314

315

We also should consider that you have the ``(parent_id,basename) => file_id``

316

map that takes up its own set of chk pages, but testing seems to indicate that

317

it is only about 1/10th that of the ``id_to_entry`` map. (rename,add,delete

318

are much less common then content changes.)

319

320

As a point of reference, one of the largest projects today OOo, has only 170k

321

revisions, and something less than 100k files (and probably 4-5 changes per

322

commit, but their history has very few merges, being a conversion from CVS).

323

At 100k files, they are probably just starting to hit 2-internal nodes, so

324

they would end up with 10 pages per commit (as a fair-but-high estimate), and

325

at 170k revs, that would be 1.7M chk nodes.

326

327

328

Scaling down

329

------------

330

331

While it is nice to scale to a 16M files tree with 1M files (100M total

332

changes), it is also important to scale efficiently to more *real world*

333

scenarios. Most projects will fall into the 255-64k file range, which is where

334

you have one internal node and 255 leaf nodes (1-2 chk nodes per commit). And

335

a modest number of changes (10 is generally a high figure). At 50k revisions,

336

that would give you 50*2*10=500k chk nodes. (Note that all of python has 303k

337

chk nodes, all of launchpad has 350k, mysql-5.1 in gc255 rather than gc255big had

338

650k chk nodes, [depth=3].)

339

340

So for these trees, scaling to 1M nodes is more than sufficient, and allows us

341

to use a 6-byte prefix per record. At a minimum, group records could use a

342

4-byte start and 3-byte length, but honestly, they are a tiny fraction of the

343

overall index size, and it isn't really worth the implementation cost of being

344

flexible here. We can keep a field in the header for the group record layout

345

(8, 4) and for now just assert that this size is fixed.

346

347

348

Other discussion

349

================

350

351

group encoding

352

--------------

353

354

In the above scheme we store the group locations as an 8-byte start, and

355

4-byte length. We could theoretically just store a 4-byte length, and then you

356

have to read all of the groups and add them up to determine the actual start

357

position. The trade off is a direct jump-to-location versus storing 3x the

358

data. Given when you have 64k groups you will need only .75MiB to store it,

359

versus the 120MB for the actual entries, this seems to be no real overhead.

360

Especially when you consider that 10M chk nodes should fit in only 250 groups,

361

so total data is actually only 3KiB. Then again, if it was only 1KiB it is

362

obvious that you would read the whole thing in one pass. But again, see the

363

pathological "conversion creating 1 group per chk page" issue.

364

365

Also, we might want to support more than 64k groups in a given index when we

366

get to the point of storing file content in a CHK index. A lot of the analysis

367

about the number of groups is based on the 100 byte compression of CHK nodes,

368

which would not be true with file-content. We should compress well, I don't

369

expect us to compress *that* well. Launchpad shows that the average size of a

370

content record is about 500-600 bytes (after you filter out the ~140k that are

371

NULL content records). At that size, you expect to get approx 7k records per

372

group, down from 40k. Going further, though, you also want to split groups

373

earlier, since you end up with better compression. so with 100,000 unique file

374

texts, you end up with ~100 groups. With 1M revisions @ 10 changes each, you

375

have 10M file texts, and would end up at 10,485 groups. That seems like more

376

64k groups is still more than enough head room. You need to fit only 100

377

entries per group, to get down to where you are getting into trouble (and have

378

10M file texts.) Something to keep an eye on, but unlikely to be something

379

that is strictly a problem.

380

381

Still reasonable to have a record in the header indicating that index entries

382

use a 2-byte group entry pointer, and allow it to scale to 3 (we may also find

383

a win scaling it down to 1 in the common cases of <250 groups). Note that if

384

you have the full 4MB groups, it takes 256 GB of compressed content to fill

385

64k records. And our groups are currently scaled that we require at least

386

1-2MB before they can be considered 'full'.

387

388

389

variable length index entries

390

-----------------------------

391

392

The above had us store 8-bytes of sha hash, 2 bytes of group number, and

393

2 bytes for record-in-group. However, since we have the variable-pointer

394

mini-index, we could consider having those values be 'variable length'. So

395

when you read the bytes between the previous-and-next record, you have a

396

parser that can handle variable width. The main problem is that to encode

397

start/stop of record takes some bytes, and at 12-bytes for a record, you don't

398

have a lot of space to waste for a "end-of-entry" indicator. The easiest would

399

be to store things in base-128 (high bit indicates the next byte also should

400

be included).

401

402

403

storing uncompressed offset + length

404

------------------------------------

405

406

To get the smallest index possible, we store only a 2-byte 'record indicator'

407

inside the index, and then assume that it can be decoded once we've read the

408

actual group. This is certainly possible, but it represents yet another layer

409

of indirection before you can actually get content. If we went with

410

variable-length index entries, we could probably get most of the benefit with

411

a variable-width start-of-entry value. The length-of-content is already being

412

stored as a base128 integer starting at the second byte of the uncompressed

413

data (the first being the record type, fulltext/delta). It complicates some of

414

our other processing, since we would then only know how much to decompress to

415

get the start of the record.

416

417

Another intriguing possibility would be to store the *end* of the record in

418

the index, and then in the data stream store the length and type information

419

at the *end* of the record, rather than at the beginning (or possibly at both

420

ends). Storing it at the end is a bit unintuitive when you think about reading

421

in the data as a stream, and figuring out information (you have to read to the

422

end, then seek back) But a given GC block does store the

423

length-of-uncompressed-content, which means we can trivially decompress, jump

424

to the end, and then walk-backwards for everything else.

425

426

Given that every byte in an index entry costs 10MiB in a 10M index, it is

427

worth considering. At 4MiB for a block, base 128 takes 4 bytes to encode the

428

last 50% of records (those beyond 2MiB), 3 bytes for everything from 16KiB =>

429

2MiB. So the expected size is for all intents and purposes, 3.5 bytes. (Just

430

due to an unfortunate effect of where the boundary is that you need more

431

bytes.) If we capped the data at 2MB, the expected drops to just under 3

432

bytes. Note that a flat 3bytes could decode up to 16MiB, which would be much

433

better for our purpose, but wouldn't let us write groups that had a record

434

after 16MiB, which doesn't work for the ISO case. Though it works *absolutely*

435

fine for the CHK inventory cases (what we have today).

436

437

438

null content

439

------------

440

At the moment, we have a lot of records in our per-file graph that refers to

441

empty content. We get one for every symlink and directory, for every time that

442

they change. This isn't specifically relevant for CHK pages, but for

443

efficiency we could certainly consider setting "group = 0 entry = 0" to mean

444

that this is actually a no-content entry. It means the group block itself

445

doesn't have to hold a record for it, etc. Alternatively we could use

446

"group=FFFF entry = FFFF" to mean the same thing.

447

448

449

``VF.keys()``

450

-------------

451

At the moment, some apis expect that you can list the references by reading

452

all of the index. We would like to get away from this anyway, as it doesn't

453

scale particularly well. However, with this format, we no longer store the

454

exact value for the content. The content is self describing, and we *would* be

455

storing enough to uniquely decide which node to read. Though that is actually

456

contained in just 4-bytes (2-byte group, 2-byte group entry).

457

458

We use ``VF.keys()`` during 'pack' and 'autopack' to avoid asking for content

459

we don't have, and to put a counter on the progress bar. For the latter, we

460

can just use ``index.key_count()`` for the former, we could just properly

461

handle ``AbsentContentFactory``.

462

463

464

More than 64k groups

465

--------------------

466

Doing a streaming conversion all at once is still something to consider. As it

467

would default to creating all chk pages in separate groups (300-400k easily).

468

However, just making the number of group block entries variable, and allowing

469

the pointer in each entry to be variable should suffice. At 3 bytes for the

470

group pointer, we can refer to 16.7M groups. It does add complexity, but it is

471

likely necessary to allow for arbitrary cases.

472

473

474

vim: ft=rst tw=78 ai

Older »