Git: What does squash do? and Why cherry-pick an older commit?

Question

The latest commit should be the one that incrementally contains all changes of the previous commits,right ?

If that is so, what use is -squash for, to just compress all commits into one for cleaning up history?

What squash actually does is just deleting all commits except the latest one?

The same goes for cherry-picking too. If the latest commit contains all changes of the previous commits then why would you need to pick up an older commit?

Answer 1

I'm not sure whether you're working under a misconception here. A lot of older source code management systems store each commit (or check-in or whatever they call it) as a "change". Git doesn't: a commit has each file fully intact, ¹ stored as a repository object of type "blob".

The "blob" objects sit at the bottom level, as it were, pointed-to by "tree" objects (which hold the names, and executable-bits, of files/directories), and the trees are pointed-to by "commit" objects. (There's one last repository-level object type, the "annotated tag", which normally points to a commit.) So given a commit SHA-1 and a repository path like dir/file , git starts by extracting the commit object, which leads to a tree that needs to have an entry for dir . That entry needs to lead to another tree, which must then have an entry for file , and then that entry should be a blob, and that's the version of dir/file that appears in that commit.

Branch and tag names, like master , are just human-readable words giving the "true name" SHA-1 of the underlying repository object. Commit objects have "parent SHA-1" values in them, allowing git to extract a commit-graph dynamically.

You can, of course, still get change-sets out of git. It just computes them dynamically, every time.

Suppose we have this graph of commits, where I use one uppercase letter to represent each 40-character SHA-1, and some branch-names:

A - B - C - D      <-- master
      \
        E - F      <-- branch

The name master just names commit D (which might really be dcfaa9d9767a010c143ffd42b01b84d2abb4cffc ). That commit has one parent, C (really 222c4dd... ). C has one parent B , B has one parent A , and A has no parents at all—it's a root commit.

Commit F cannot be reached by starting at commit D and working backwards through any parents; it is only reachable by its branch-name branch . F 's parent is E , and E 's (single) parent commit is B , though, so starting from F we can work backwards through B to A .

This is where regular merges come in: they operate on the commit graph. If we "merge branch into master "—let's do this temporarily, then back it out again:

$ git checkout master
Switched to branch 'master'
$ git merge -m regular-merge branch
[snip]

—we make one new merge commit M :

A - B - C - D - M  <-- master
      \       /
        E - F      <-- branch

This (real, non-"squash") merge has two parents, D and F . (And, importantly, D is the "first" parent: this tells you which commit was originally "on master " before the merge.) So now the two commits that used to be reachable only via branch (starting at F and working backwards), can be found on master too.

What about the various file contents? Well, that's up to whoever does the merge. Using git's automatic merging, if you "merge branch into master " and there are no conflicting changes, you'll get what you'd expect. However, you could merge with -s ours to discard the contents of the changes in E and F . You'll still get a merge commit M , but its tree will be identical to the tree in commit D . ²

At any time you can also ask git to produce the change-set from one commit to another. So if you want to see what changed between B and E , you could find the SHA-1 for both and do:

git diff <sha1-for-B> <sha1-for-E>

To see what changed between E and F , you can simply use their SHA-1s, and to see "what happened on branch branch ", use the SHA-1s for B and F .

As a very convenient convenience, to see what happened between some commit and its parent—let's not worry about merges here since they have multiple parents—we can just use, eg:

git show <sha1-for-F>

The git show command will (among other things) find F 's parent and run a diff between E and F , to show us what the changes there were.

Instead of writing out the full (or partial) SHA-1, we can just use the branch name:

git show branch

In general, if a raw SHA-1 will work, so will a branch name (but not always vice versa).

Naturally, since the SHA-1s are unweildy, there are lots more ways to name these things; but let's just ignore that for now and finally get to "squash merges" and "cherry pick".

Let's "un-do" the merge into master so that master and branch are separate once again: ³

$ git status   # just checking! "git reset --hard" could lose work
# On branch master
nothing to commit, working directory clean
$ git reset --hard master^
[snip]

We're now back to this commit graph:

A - B - C - D      <-- master
      \
        E - F      <-- branch

Let's say that change F fixes a nasty bug and we want a copy of it in master . We tried to take it in by using git merge branch , but that brought in change E too, which is not ready yet.

So, now we just "cherry pick" commit F :

$ git cherry-pick branch
[snip]

As usual we can use the branch name to identify the (single) commit at the tip of the branch.

This tells git to gather up the changes between E and F , just like git show would. Instead of showing them to us, though, git tries to patch those changes into the current ( HEAD ) commit. Since we're on branch master, the HEAD commit is commit D . So this extracts the changes from E to F , applies them to D , and if successful, makes a new commit. Let's call it P (for Pick):

A - B - C - D - P  <-- master
      \
        E - F      <-- branch

Here the contents of P may be quite different from the contents of F , but the change (from D to P ) is the same as the change from E to F . The diff output of git show master and git show branch should be very similar—the line numbers might change a bit (or even a lot) but the changes shown should be the same.

Let's toss out P the same way we tossed out the merge M earlier. ⁴ Note that we're still on branch master here, and it's still clean (nothing going on, nothing to commit, even though I'm not bothering with git status this time):

$ git reset --hard master^
[snip]

This time, let's do a "squash-merge" of branch .

The action of this git merge is very similar to a regular merge, except that instead of merge-commit M with two parents, it will set up a "squash merge" commit, let's call it S , with only one parent. It doesn't actually do the commit (squash implies --no-commit ) so we have to do the commit part explicitly:

$ git merge --squash branch
Squash commit -- not updating HEAD
Automatic merge went well; stopped before committing as requested
$ git commit -m squash-merge
[snip]

Now we have this:

A - B - C - D - S  <-- master
      \
        E - F      <-- branch

The tree for commit S —the set of all files—will be the same as the tree you'd get with a regular merge. In this case, that would be the equivalent of applying, as a single patch, the git diff between B and F , to the tree-contents of commit D . ⁵

In other words, S has "the changes between B and E plus the changes between E and F ", applied to D . But it has only one single parent. It's this commit-graph difference that makes it a "squash merge" rather than a regular merge.

Of course, if the regular merge didn't work—commit E was not ready for master —the squash merge won't work either. So here cherry-pick is the sensible option.

Important aside: note how each time we did something to branch master that added a new commit—the merge M , the cherry pick P , or the squash-commit S —the branch master automatically "moved forward" to point to the newest commit. That's what distinguishes a branch (or a "local branch") from other labels, in git. A branch name is just a commit-ID that automatically moves as you add new commits. Tag names work exactly like branch names except that they don't move.

¹ Well, blobs have them intact, but compressed (with deflate compression), as long as they are "loose objects". Eventually loose objects are "packed" to save even more space, and packs can then be "delta-compressed", giving all the space savings available in the older delta-based SCMs—and actually more, because any one object can be compressed against any other object, at least in theory. File foo does not have to compressed only against "previous version of file foo".

² This is mostly meant as a way to document the "killing off" of a branch, as an alternative to simply abandoning it.

³ The reset --hard does two things: it modifies the working directory back to the state it had at commit D , and, it changes the branch label master to point back to commit D again. The simple ^ here suffix tells git to follow the "first parent". The other main syntax, "going back N commits"—eg, master~3 goes back 3—also follows "first parents", so from merge commit M , master~3 would count back to D , then C , then B . Once the reset has taken effect, though, master names commit D again, so going back 3 goes to C , then B , then A instead.

⁴ Incidentally, you might—in fact, you should —wonder: what happens to these commits we're casually "tossing out"? The answer is: they live on in the repo, labeled through the "reflog", until the reflog entries time out. By default, a reflog entry "expires" after 90 days if it's "reachable"—defining this gets a bit too technical for this footnote that's already too long—and 30 days if it's "unreachable". These are "unreachable", so they expire in about a month. After that, the commits, and any trees and blobs that only these tossed-out commits use, are garbage-collected on the next git gc .

⁵ This assumes you don't do something weird like apply the ours strategy to the squash-merge, but that would be useless. (Also, unlike the not-part-of-git patch command, git is smart when doing merges and cherry-picks and such, in that it can usually tell if you already have some particular patch applied, and not try to apply it twice.)

Answer 2

Yes, squashing will collapse commit history. You might do this if, for example, you have been working on a feature on a separate branch, and you don't want to pollute a mainline branch with hundreds of commits. On the other hand, you'd probably only want to squash a branch you're closing, since you lose the tracking benefit of a non-squashed merge. (you can look at the tree and see which commit the merge came from in the case of a "normal" merge; not so with a "squashed" merge.)

Usually you cherry-pick changes from other branches; your current branch won't contain a commit you want to cherry-pick.

Git: What does squash do? and Why cherry-pick an older commit?

Question

2 answers

solution1
12 ACCPTED 2013-12-10 09:23:11

solution2
2 2013-12-10 07:40:33

Git: What does squash do? and Why cherry-pick an older commit?

Question

2 answers

solution1 12 ACCPTED 2013-12-10 09:23:11

solution2 2 2013-12-10 07:40:33

solution1
12 ACCPTED 2013-12-10 09:23:11

solution2
2 2013-12-10 07:40:33