简体   繁体   中英

How does git track file changes internally?

Could somebody explain how git knows internally that files X, Y and Z have changed? What is the process behind the scenes that recognizes when a file has not yet been added or has modifications? I am asking because, with Subversion it's simple to figure out that it keeps track of these things by having a .svn directory under each folder, but for git I can't seem to find a description of the inner workings of this. I doubt it scans through all the sub-directories for changes, as it's quite fast.

So, out if curiosity, what are it's inner workings?

The mechanisms by which one determines the status of a file is fairly straightforward. To know what files have been staged, one simply diffs the HEAD tree with the index. Any items that appear only in the index have been staged for addition, any items that appear only in HEAD have been removed and any items that are different have had changes staged.

Similarly, one would detect unstaged changes by diff'ing the index with the working directory.

Your question in particular asks how this can be so fast (after all, computing the SHA1 hash of a file is not exactly speedy.) This is where the index - also known as the cache - comes in to play again. The index also has fields for the file size and file modification time . Thus one can simply stat(2) a file on disk and compare against the index's file size and file modification time to know whether to hash the file or not.

You can find your answer in the free book Pro-Git on chapter Git Internals

This chapter explains how git works behind the hood.

As Leo stated, git checks the SHA1 of the files to see if it has changed you can check it like this (Taken from Git Internals):

$ echo 'version 1' > test.txt
$ git hash-object -w test.txt
83baae61804e65cc73a7201a7252750c76066a30

Then, write some new content to the file, and save it again:

$ echo 'version 2' > test.txt
$ git hash-object -w test.txt
1f7a7a472abf3dd9643fd615f6da379c4acb3e3a

If the answer in the possible duplicate doesn't suffice you might want to take a look at this http://www.geekgumbo.com/2011/07/19/git-basics-how-git-saves-your-work/

To make a long story short, Git uses the SHA-1 of the file contents to keep track of changes. Git keeps track of four objects: a blob, a tree, a commit, and a tag.

To answer your question on how it keeps track of changes here's a quote from that link:

The tree object is how Git keeps track of file names and directories. There is a tree object for each directory. The tree object points to the SHA-1 blobs, the files, in that directory, and other trees, sub-directories at the time of the commit. Each tree object is encrypted into, you guessed it, a SHA-1 hash of its contents, and stored in .git/objects. The name of the trees, since they are SHA-1 hashes, allow Git to quickly see if there's been any changes to any files or directories by comparing the name to the previous name. Pretty slick.

I found the following explanation helpful in a recent course, Git Essential Training by Kevin Skoglund, I followed at Lynda.com.

Git generates a hash key composed of 40 hexadecimal characters by running an algorithm on the changes we have committed. For instance if we commit same set of changes at different occasions, we should get the same hash key.

Additionally it keeps track of previous changes by keeping following meta information in each commit.

  1. Current Commit ID
  2. Parent Commit ID
  3. Author
  4. Message

Each subsequent commit will refer to a parent commit, while the first commit will not have a parent(or a null/nil value). The following diagram would be helpful in this regard.

Image credit : Git Essential Training by Kevin Skoglund at Lynda.com

Git Essential Training by Kevin Skoglund at Lynda.com

I found this article very helpful.

https://codewords.recurse.com/issues/two/git-from-the-inside-out

Git is built on a graph. Almost every Git command manipulates this graph. To understand Git deeply, focus on the properties of this graph, not workflows or commands.

Extract - Make a commit that is not the first commit

The user sets the content of data/number.txt to 2 . This updates the working copy, but leaves the index and HEAD commit as they are.

The user adds the file to Git. This adds a blob containing 2 to the objects directory. It points the index entry for data/number.txt at the new blob.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM