~abentley/bzr/devnotes : contents of nested-trees.txt at revision 24

~abentley/bzr/devnotes : (revision 24)
************
Nested Trees
************

Principles
**********

- Never store a location in versioned data.

- Implementation of nested trees shall not make operations observably slower
  for those not using nested trees.  (Using nested trees will impact performance
  in the initial implementation, but the design will allow a performant
  implementation.)

- A repository that holds a revision R should be able to reconstruct the
  whole contents of that revision, including any nested trees.  Corolary:
  if I fetch that revision, even into a branch that has no working tree, it
  should bring across any referenced revisions.

- A project with zero nested trees must pay zero performance cost when
  using a format supporting them.

Design
******

(See also <http://bazaar-vcs.org/NestedTreesDesign>)

Inventory entries of type 'tree reference' have a file-id and a revision-id.

In the working tree, subtrees are represented as a subdirectory.  The top
containing tree's branch will be in a format that contains:

 - the last-revision pointers for all the nested branches indexed by the file-id
   of their tree reference.  
 - the subbranches will use the same repository as the top tree.  This
   repository may be at the top tree, or a shared repository used by the top
   tree.

On disk, the subtrees will simply indicate that their data is stored in a top
tree.  The top tree for a subtree is located by walking parent dirs until a top
tree is found.

Only the top branch will have a branch.conf.  When an operation on a subbranch
would normally use values from branch.conf it will look them up in the top
branch's branch.conf and adjust for the sub-location if appropriate.  e.g. "bzr
push" in a subtree will push just that subbranch to the corresponding subbranch
in the configured push location of the top branch.

Process
*******

How do we best manage the risk associated with this complex feature?

Is it by getting something out there and planning/expecting a few
iterations to smooth out issues? Is it by taking more time before
coding and landing patches to discuss the design? Or a mix of these?

We need to be careful. On the one hand, we don't want to put user
data at risk and that's possible if the core data model (not caches)
is flawed. On the other, we don't want to get bogged down worrying
about obscure edge cases now at the expense of solving the common
cases well. This feature will be a classic case of 90% of users
only needing 10% of the capability IMNSHO. (igc)


Undecided questions
*******************

Shall commands recurse downwards by default?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes.

Pros:

 - It is hard to accidentally produce inconsistent trees
 - Inconsistent trees are hard for remote users to handle
 - Accidentally committing too many things at once is easy to resolve

Cons:

 - It is hard to accidentally commit too many things at once
 - Accidentally committing nuclear launch codes is easier to do
 - More risk of exposing users to bugs
 - Makes incompleteness more glaring
 - A commit message that makes sense for the top may not make sense lower down.

Shall commands recurse upwards by default?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No.


Shall subtree branches be addressable?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes.  The problems to be solved include:

 - the "name" of subbranches can change from revision to revision.  In fact,
   subbranches may have no name in a given revision.
 - a user may want to get revision X of a subbranch named "foo" in revision Y of
   the top branch.

The syntax will be ``BRANCH#SUBBRANCH``, e.g. ``lp:bigproject#subproject``.
``BRANCH`` is a regular URL and ``SUBBRANCH`` is a path within BRANCH's tree
identifying a subbranch.  Simultaneously addressing a specific revision in both
BRANCH and SUBBRANCH is not currently defined.


Shall we model nested trees as a composite tree?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes.  Users will see recurse-downwards behaviour that allows operations that
cross subtree boundaries, e.g. a merge in the top tree can move a file between
subtrees.

The downside is that we can't have cheap support for subtrees that are copies of
one another, because we wouldn't know which copy to apply sets of changes to.


Shall we use root-ids for tree references?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes.  This fits well with our current lack of support of file copies.  If we do
support file copies in future it will be possible to change this in a future
format, and perform deterministic upgrades to that format.


Shall we recurse our low-level operations from the beginning?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No.  It will be necessary eventually for good performance, but the initial
implementation is conservative to minimise risk to the rest of bzr.

Shall we lock recursively?
~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes.  It matches existing behaviour by failing earlier, and the extra cost does
not seem onerous.  (To be fully efficient this requires an index of the subtrees,
otherwise we need to scan the fully inventory/dirstate.)

(Also, this decision can be changed later with no compatibility concerns.)


How do we handle merge when the subtree hasn't diverged?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

"bzr merge --pull" will be changed so that it will merge (not pull) when the
local last revision's revno would change (i.e. is a non-lhs parent in the merge
source).  This is expected to be the most common way to update nested trees.

The existing "bzr merge --pull" behaviour will be renamed to "bzr merge --pull-renumber".

"bzr merge" (with no "--pull") will do a merge in all trees.  "bzr pull" will do
a pull in all trees.

The rationale is that a very common use-case is that the top tree is a project
the user is actively committing to, and the subtrees are mainly libraries that
are being mirrored.  So a behaviour that forced every update to be a merge would
be undesirable for the mirrored subtrees, but an update that is a pull wouldn't
suit the changing top tree.  And the existing "merge --pull" (that can renumber
revisions) isn't desireable for either the top tree or subtrees in this case.


What should uncommit do?
~~~~~~~~~~~~~~~~~~~~~~~~

It will recurse, and subtrees will be uncommitted back to the revision recorded
by the revision the top tree is uncommitting to.

This means that operations like::

   $ echo foo > versioned-file-in-top-tree.txt
   $ bzr ci -m "Change file"
   $ bzr uncommit

will not cause a change in subtrees, since the top-level commit did not affect
them.


Shall we use a CompositeTree object as a shim to make existing commands work?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes.  The affected commands are diff, status, and export.  These are all
read-only commands so there's no danger of harming user data.  The only risks
are giving inaccurate results and poor performance.


Some subtrees should have commits and some should not.  How?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We won't do this initially.  Flagging some sub-trees as mirror-only or similar
sounds like a nice feature, but we can add this later (and it may only require a
working tree format bump to add).

It would be nice to have more infrastructure in general to handle mirrors, and
this is not specific to nested trees.


How do we handle files that are moved into subtrees (from another tree in the same top tree)?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Add a revision property:
"Files-from-other-trees: (rev-id: [f-id, ...]), ..."

(Once we can do cherry-pick merges, we might use those instead?)



Handling svn:externals
~~~~~~~~~~~~~~~~~~~~~~

svn externals commonly have references to the tip of some other branch.
Because we want a deterministic representation of the inventory when
pulling from svn, we can't put the specific revision-id into the inventory
at that time.  However, we do want to bring down all the relevant data at
the time of fetching, so that we can later branch locally.  This seems to 

Decided questions
*****************

The branches of subtrees shall share a repository with the containing tree.

The branches of subtrees shall be in a special format that shares a single
last_revision file that is stored in the containing branch.

The subtree branches shall be referenced in the last_revision file by file-id.

Subtree branches shall not support individual configuration.

Fetch shall automatically fetch the revisions mentioned by tree-references,
recursively.

The reserved revision-id "head:" shall be used in tree-references to refer to
the tip revision of a branch.

bzr-svn repositories with externals shall behave as though the multiple
repositories were a single Bazaar repository with multiple branches.


Use Cases
*********

Case 1
~~~~~~
Barry works on a project with three libraries.  He wants to keep up to date
with the tip of those libraries, but he doesn't want them to be part of his
source tree.

Case 2
~~~~~~
Now, Barry wants to add a fourth library.

Case 3
~~~~~~
Barry wants to publish his project.

Case 4
~~~~~~
Barry decides to make part of his project into another library

Case 5
~~~~~~
Curtis wants to hack on Barry's project

Case 6
~~~~~~
Barry wants to drop one of the libraries he was using

Case 7
~~~~~~
Curtis has made changes to one of the libraries.  Barry wants to merge Curtis'
changes into his copy.

Case 8
~~~~~~
Curtis has made changes to Barry's main project.  Barry wants to merge Curtis'
changes into his copy.

Case 9
~~~~~~
Barry makes changes in his project and in a library, and he runs status

Case 10
~~~~~~~
Barry wants to upgrade the bazaar format of his project

Case 11
~~~~~~~
Curtis wants to apply Barry's latest changes.

Case 12
~~~~~~~
Danilo wants to start a project with two libraries using nested trees from
scratch.

Case 13
~~~~~~~
Edwin has a project that doesn't use nested trees and he wants to start using
nested trees.

Case 14
~~~~~~~
Françis has a project with nested trees where the containing tree uses one
Bazaar format and the subtree uses a different Bazaar format.

Case 15
~~~~~~~
Barry commits some changes to a library and to the main project, and then
discovers the changes are not appropriate.

Case 16
~~~~~~~
Gary is writing a project.  Henninge wants to split a library out of it.

Case 17
~~~~~~~
Henning wants to update to receive Gary's latest changes.

Case 18
~~~~~~~
Gary wants to update to receive Henninge's changes, including splitting a
library out.

Case 19
~~~~~~~
Gary wants to update to receive Henninge's changes, without splitting a library
out.

.. vim: ft=rst