简体   繁体   中英

How does git recreate deltified objects from packfiles?

How does git parse a packfile? One crucial step that I don't seem to find any information on is how git handles deltified objects in the packfile. One good resource that I found was the git documentation, which I cannot find anymore. But here is something like a copy of that. I understand that git compresses the various data that it keeps in the objects directory, such as commits, trees, blobs, and the like, but also Delta objects, where a base data is connected with the delta data somehow. Here is the list of object types : OBJ_COMMIT, OBJ_TREE, OBJ_BLOB, OBJ_TAG, OBJ_OFS_DELTA, OBJ_REF_DELTA. I also understand that there are two types of operations, namely copy and insert, but it is not clear to me how these operations act to reconstruct a modified file from the packfile.

I also found a fairly good guide here .

Say, that I have an OBJ_REF_DELTA object sitting somewhere in the packfile. In that packfile I will then be able to parse the base object from the first 20 bytes (and maybe find it in the packfile through offsets stored in the index file or something). Then comes the zlib-compressed data of the delta. How does that decompressed data look like. Are those the copy snippets or the insert snippets, or both? It says:

The delta begins with the source and target lengths, both encoded as variable-length integers, which is useful for error checking, but is not essential. After this, there are a series of instructions, which may be either “copy” (MSB = 1) or “insert” (MSB = 0).

By The delta begins , do they mean that compressed data after the ref of the base object? No, because they say that the copy instructions are used to copy from the base object:

Copy instructions signal that we should copy a consecutive chunk of bytes from the base object to the output. There are two numbers that are necessary to perform this operation: the location (offset) of the first byte to copy, and the number of bytes to copy. These are stored as little-endian variable-length integers after each copy instruction; however, their contents are compressed.

So where do the copy and insert instructions actually sit? And why is there no delete option. I learned that delta is not diff, so maybe it means that the additions to the file I made are not the delta, but that the delta is computed from the two files that are stored as blobs or something and that the delta can only have copy and insert and no delete. Is that correct?

The technical documents are in the Git repository, under Documentation/technical .

Say, that I have an OBJ_REF_DELTA object sitting somewhere in the packfile. In that packfile I will then be able to parse the base object from the first 20 bytes (and maybe find it in the packfile through offsets stored in the index file or something).

Yes; or, if this is an OBJ_REF_DELTA there will be a negative relative position stored here.

For any normal pack file, an OBJ_OFS_DELTA object must be in the pack file. Pack files may not refer to objects that are not in themselves. A thin pack , however, violates this rule and an OBJ_OFS_DELTA can refer to an object that is not in the pack, in which case you must find the object in some other pack file or as a loose object.

(Multi-pack-index files, if present, give you another way to find the packfile(s) that contain some object(s).)

Then comes the zlib-compressed data of the delta. How does that decompressed data look like. Are those the copy snippets or the insert snippets, or both?

Both of course:

It says:

The delta begins with the source and target lengths, both encoded as variable-length integers, which is useful for error checking, but is not essential. After this, there are a series of instructions, which may be either “copy” (MSB = 1) or “insert” (MSB = 0).

By The delta begins, do they mean that compressed data after the ref of the base object? No, because they say that the copy instructions are used to copy from the base object...

Right. We have these unnecessary-except-for-error-checking values, which we can completely ignore if we want. Then we have instructions, which either read:

  • copy N bytes from delta base object at offset O of delta base object

(with two variable-length numbers in it), or:

  • insert N bytes

(with one variable-length number in it). The numbers are encoded cleverly to save space, but ignoring the cleverness, let's just assume we have an N and an O (for a copy) or an N (for insert):

why is there no delete option?

When would you use one? Suppose we'd like to take the first ten bytes at offset 100, then the next 32 bytes at offset 6. The result is 42 bytes long. Which of those 42 bytes would you like to delete? Why? Why didn't we just take 9 bytes or 31 bytes?

You might say: well, if I could take 32 bytes and delete one in the middle I would have the 31 bytes I want but we can encode that as "take 15 bytes at offset 6, then take 16 bytes at offset 22". That's slightly longer, because we had to encode an N and O for each, rather than one N-and-O and one delete-N only, but at the same time it's also slightly shorter because if we had a delete we'd need an offset for where to do the delete. So in the end it's just a wash.

We still need the "insert N bytes" operation since we might have to insert the byte-sequence inconceivable that occurs nowhere in any existing object. Lacking that, we'd need objects that hold at least 1 of each possible byte (a single object with all 256 possible bytes in it would suffice)—but this would be less efficient and obvious than our simple "insert N".

So where do the copy and insert instructions actually sit?

The entire object ( OBJ_REF_DELTA or OBJ_OFS_DELTA ) consists of nothing but these instructions . We're to copy some chunk(s) from some existing object, and insert some other chunk(s) in before, between, and/or after some or all of these copied chunks, and the result is the final object.

... so maybe it means that the additions to the file I made are not the delta, but that the delta is computed from the two files that are stored as blobs or something and that the delta can only have copy and insert and no delete. Is that correct?

That is indeed correct. Git starts with "loose-equivalent objects", at least conceptually, and compresses object O k against object O j with each object considered as a single total byte sequence. The compressor tries O k against as many candidate objects O 0 , O 1 , O 2 , ... as it feels worth trying; whatever gets the "best" result is the final result. Note that O k is now in the "window" of things that can be used as base objects for the next object O l , and so on; this is true whether or not O k wound up being compressed.

(There's a maximum chain length, so at some point objects that themselves have to be built from objects that have to be built from objects that have to be built from, etc., no longer go into the window.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM