Module b.dirstate

Part of bzrlib

DirState objects record the state of a directory and its bzr metadata.

Pseudo EBNF grammar for the state file. Fields are separated by NULLs, and
lines by NL. The field delimiters are ommitted in the grammar, line delimiters
are not - this is done for clarity of reading. All string data is in utf8.

MINIKIND = "f" | "d" | "l" | "a" | "r" | "t";
NL = "
";
NULL = "";
WHOLE_NUMBER = {digit}, digit;
BOOLEAN = "y" | "n";
REVISION_ID = a non-empty utf8 string;

dirstate format = header line, full checksum, row count, parent details,
 ghost_details, entries;
header line = "#bazaar dirstate flat format 3", NL;
full checksum = "crc32: ", ["-"], WHOLE_NUMBER, NL;
row count = "num_entries: ", WHOLE_NUMBER, NL;
parent_details = WHOLE NUMBER, {REVISION_ID}* NL;
ghost_details = WHOLE NUMBER, {REVISION_ID}*, NL;
entries = {entry};
entry = entry_key, current_entry_details, {parent_entry_details};
entry_key = dirname,  basename, fileid;
current_entry_details = common_entry_details, working_entry_details;
parent_entry_details = common_entry_details, history_entry_details;
common_entry_details = MINIKIND, fingerprint, size, executable
working_entry_details = packed_stat
history_entry_details = REVISION_ID;
executable = BOOLEAN;
size = WHOLE_NUMBER;
fingerprint = a nonempty utf8 sequence with meaning defined by minikind.

Given this definition, the following is useful to know:
entry (aka row) - all the data for a given key.
entry[0]: The key (dirname, basename, fileid)
entry[0][0]: dirname
entry[0][1]: basename
entry[0][2]: fileid
entry[1]: The tree(s) data for this path and id combination.
entry[1][0]: The current tree
entry[1][1]: The second tree

For an entry for a tree, we have (using tree 0 - current tree) to demonstrate:
entry[1][0][0]: minikind
entry[1][0][1]: fingerprint
entry[1][0][2]: size
entry[1][0][3]: executable
entry[1][0][4]: packed_stat
OR (for non tree-0)
entry[1][1][4]: revision_id

There may be multiple rows at the root, one per id present in the root, so the
in memory root row is now:
self._dirblocks[0] -> ('', [entry ...]),
and the entries in there are
entries[0][0]: ''
entries[0][1]: ''
entries[0][2]: file_id
entries[1][0]: The tree data for the current tree for this fileid at /
etc.

Kinds:
'r' is a relocated entry: This path is not present in this tree with this id,
    but the id can be found at another location. The fingerprint is used to
    point to the target location.
'a' is an absent entry: In that tree the id is not present at this path.
'd' is a directory entry: This path in this tree is a directory with the
    current file id. There is no fingerprint for directories.
'f' is a file entry: As for directory, but its a file. The fingerprint is a
    sha1 value.
'l' is a symlink entry: As for directory, but a symlink. The fingerprint is the
    link target.
't' is a reference to a nested subtree; the fingerprint is the referenced
    revision.

Ordering:

The entries on disk and in memory are ordered according to the following keys:

    directory, as a list of components
    filename
    file-id

--- Format 1 had the following different definition: ---
rows = dirname, NULL, basename, NULL, MINIKIND, NULL, fileid_utf8, NULL,
    WHOLE NUMBER (* size *), NULL, packed stat, NULL, sha1|symlink target, 
    {PARENT ROW}
PARENT ROW = NULL, revision_utf8, NULL, MINIKIND, NULL, dirname, NULL,
    basename, NULL, WHOLE NUMBER (* size *), NULL, "y" | "n", NULL,
    SHA1

PARENT ROW's are emitted for every parent that is not in the ghosts details
line. That is, if the parents are foo, bar, baz, and the ghosts are bar, then
each row will have a PARENT ROW for foo and baz, but not for bar.


In any tree, a kind of 'moved' indicates that the fingerprint field
(which we treat as opaque data specific to the 'kind' anyway) has the
details for the id of this row in that tree.

I'm strongly tempted to add a id->path index as well, but I think that
where we need id->path mapping; we also usually read the whole file, so
I'm going to skip that for the moment, as we have the ability to locate
via bisect any path in any tree, and if we lookup things by path, we can
accumulate an id->path mapping as we go, which will tend to match what we
looked for.

I plan to implement this asap, so please speak up now to alter/tweak the
design - and once we stabilise on this, I'll update the wiki page for
it.

The rationale for all this is that we want fast operations for the
common case (diff/status/commit/merge on all files) and extremely fast
operations for the less common but still occurs a lot status/diff/commit
on specific files). Operations on specific files involve a scan for all
the children of a path, *in every involved tree*, which the current
format did not accommodate. 
----

Design priorities:
 1) Fast end to end use for bzr's top 5 uses cases. (commmit/diff/status/merge/???)
 2) fall back current object model as needed.
 3) scale usably to the largest trees known today - say 50K entries. (mozilla
    is an example of this)


Locking:
 Eventually reuse dirstate objects across locks IFF the dirstate file has not
 been modified, but will require that we flush/ignore cached stat-hit data
 because we won't want to restat all files on disk just because a lock was
 acquired, yet we cannot trust the data after the previous lock was released.

Memory representation:
 vector of all directories, and vector of the childen ?
   i.e. 
     root_entrie = (direntry for root, [parent_direntries_for_root]), 
     dirblocks = [
     ('', ['data for achild', 'data for bchild', 'data for cchild'])
     ('dir', ['achild', 'cchild', 'echild'])
     ]
    - single bisect to find N subtrees from a path spec
    - in-order for serialisation - this is 'dirblock' grouping.
    - insertion of a file '/a' affects only the '/' child-vector, that is, to
      insert 10K elements from scratch does not generates O(N^2) memoves of a
      single vector, rather each individual, which tends to be limited to a 
      manageable number. Will scale badly on trees with 10K entries in a 
      single directory. compare with Inventory.InventoryDirectory which has
      a dictionary for the children. No bisect capability, can only probe for
      exact matches, or grab all elements and sort.
    - What's the risk of error here? Once we have the base format being processed
      we should have a net win regardless of optimality. So we are going to 
      go with what seems reasonable.
open questions:

Maybe we should do a test profile of the core structure - 10K simulated
searches/lookups/etc?

Objects for each row?
The lifetime of Dirstate objects is current per lock, but see above for
possible extensions. The lifetime of a row from a dirstate is expected to be
very short in the optimistic case: which we are optimising for. For instance,
subtree status will determine from analysis of the disk data what rows need to
be examined at all, and will be able to determine from a single row whether
that file has altered or not, so we are aiming to process tens of thousands of
entries each second within the dirstate context, before exposing anything to
the larger codebase. This suggests we want the time for a single file
comparison to be < 0.1 milliseconds. That would give us 10000 paths per second
processed, and to scale to 100 thousand we'll another order of magnitude to do
that. Now, as the lifetime for all unchanged entries is the time to parse, stat
the file on disk, and then immediately discard, the overhead of object creation
becomes a significant cost.

Figures: Creating a tuple from from 3 elements was profiled at 0.0625
microseconds, whereas creating a object which is subclassed from tuple was
0.500 microseconds, and creating an object with 3 elements and slots was 3
microseconds long. 0.1 milliseconds is 100 microseconds, and ideally we'll get
down to 10 microseconds for the total processing - having 33% of that be object
creation is a huge overhead. There is a potential cost in using tuples within
each row which is that the conditional code to do comparisons may be slower
than method invocation, but method invocation is known to be slow due to stack
frame creation, so avoiding methods in these tight inner loops in unfortunately
desirable. We can consider a pyrex version of this with objects in future if
desired.
Line # Kind Name Docs
234 Function pack_stat 0 Convert stat values into a packed representation.
243 Function pack_stat Convert stat values into a packed representation.
265 Class DirState Record directory and metadata state for fast access.
2762 Function py_update_entry Update the entry based on what is actually on disk.
2846 Class ProcessEntryPython No class docstring; 2/4 methods documented
def pack_stat 0(st, _encode=binascii.b2a_base64, _pack=struct.pack):
Convert stat values into a packed representation.
def pack_stat(st, _encode=binascii.b2a_base64, _pack=_compiled_pack):
Convert stat values into a packed representation.
def py_update_entry(state, entry, abspath, stat_value, _stat_to_minikind=DirState._stat_to_minikind, _pack_stat=pack_stat):
Update the entry based on what is actually on disk.

This function only calculates the sha if it needs to - if the entry is uncachable, or clearly different to the first parent's entry, no sha is calculated, and None is returned.

ParametersstateThe dirstate this entry is in.
entryThis is the dirblock entry for the file in question.
abspathThe path on disk for this file.
stat_valueThe stat value done on the path.
ReturnsNone, or The sha1 hexdigest of the file (40 bytes) or link target of a symlink.
API Documentation for BzrLib, generated by pydoctor at 2008-12-03 00:00:12.