git filter-branch followed by git push caused double commits

Question

I was trying to remove some large binaries from a repo to reduce its cloning size. After researching the topic I stumbled upon the following script:

#!/bin/bash

# this script displays all blob objects in the repository, sorted from smallest to largest
# you may need `brew install coreutils --with-default-names`

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| grep -vF "$(git ls-tree -r HEAD | awk '{print $3}')" \
| awk '$2 >= 2^20' \
| sort --numeric-sort --key=2 \
| gcut -c 1-12,41- \
| gnumfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Taken from https://stackoverflow.com/a/42544963/5470921 with some tweaks.

The output is something like:

0d99bb931299   44MiB other/assets.sketch
2ba44098e28f   44MiB other/assets.sketch
bd1741ddce0d   45MiB other/assets.sketch

The next step would be to remove the files unwanted. For that I used the following script:

# to remove a file (displayed path/to/file in the output)
git filter-branch --index-filter 'git rm --cached --ignore-unmatch path/to/file' --tag-name-filter cat HEAD

Taken from https://stackoverflow.com/a/46615578/5470921 .

So far so good. Next I ran the following command foolishly on the master branch without making any backups:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch other/assets.sketch' --tag-name-filter cat HEAD

This created a new commit named Merge remote-tracking branch 'origin/master' . After that I clicked the Sync button in the GitHub Desktop client, pushing the changes to the repo.

When running the first script again, I saw that the files are still there, they weren't removed. After further investigation, I noticed that I have double commits now in the repo.

I spent a day trying to restore the repo to its old state without any luck, while doing so I deleted the local repo from my device as well, which means I no longer have the git reflog history nor do I have access to something like refs/original/refs/heads/master .

How can I restore the repo to its original case? is that still possible?

Answer 1

Note : if this is TL;DR, skip down to the last section, How to fix it (but it will make more sense if you read the preceding ones).

What you need to understand is that git filter-branch copies commits. That is, it takes each existing commit, applies some filter or set of filters to it, and makes a new commit from the result. That's how you ended up with two sets of commits. This is necessary because it is in no one's power, especially not Git's, to change anything about any existing commit.

The filtered commits are a new history, largely independent of the original history. (Some details depend on the precise filters and commit inputs.) It's worth keeping in mind that a Git repository does not contain files , precisely; it contains commits , and the commits are the history. Each commit contains a snapshot—so in that sense, the repository does contain files, but they're one step below the overview, which is on a commit-by-commit basis.

Every commit has a unique hash ID. These are the big long ugly names you see in git log output: commit b7bd9486b055c3f967a870311e704e3bb0654e4f and so on. This unique ID serves for Git to find the commit object, and hence the files; but the hash ID itself is simply a cryptographic checksum of the full contents of the commit. Each commit lists the hash ID of its parent commit (or commits) as well, and the parent hash (and the snapshot hash) is part of the commit's contents. This is why Git can't change anything about a commit: if you take the contents, and change anything , even a single bit, and make a new commit out of that, you get a new, different hash ID, which is a new, different commit.

Since each commit contains the ID of its parent(s), this means that if we somehow tell Git—by hash ID—which commit is the newest , it can pull that commit out and use it to find the second-newest commit:

...  <--second-newest  <--newest

The second-newest points back to the third-newest, and so on. If the chain is totally linear (if there are no branches and merges), we end up with a very simple picture:

A--B--C--D--E--F--G--H   <-- master

Here, the name master remembers the actual hash ID of the latest commit, which we'll call H instead of coming up with its actual hash ID. Commit H remembers the hash ID of the previous commit G , which remembers the ID of F , and so on. Commit A is the very first commit, so it just has no parent at all, which lets the action stop.

Branching is just a matter of picking out some commit in the chain and creating a child that's not at the tip of master . For instance, suppose we leave master where it is, pointing to H , and make a new commit I on a new branch we call dev :

...--H   <-- master
      \
       I   <-- dev (HEAD)

If we then git checkout master and make a new commit J we get:

...--H--J   <-- master (HEAD)
      \
       I   <-- dev

Note that the act of putting new commits into the repository requires that we have Git change one of the names . We put new commit I in, and made Git change the name dev —which used to point to H along with master —so that dev points to (contains the hash ID of) I . Then we put new commit J in, making Git update master to point to J instead of H .

(The special name HEAD is simply attached to whichever branch-name is the one we want Git to update when we run git commit .)

Filter-branch

The filter-branch command iterates over some set of commits—often all commits, depending on how you use it; you ran it over HEAD which means the current branch, but perhaps you have only one branch name, master —and copies them. It starts by listing, in the appropriate order, every commit hash ID that is to have the copying process applied. If all you have is a linear chain (like AB-...-H ), this is those IDs in that order. Let's assume this for simplicity.

Then, for each such commit, filter-branch:

extracts the commit into a temporary area (or pretends to, for speed);
applies your filter(s);
uses git commit or equivalent (depending on filters, again) to make a new commit that preserves every unchanged bit, but keeps whatever changes are made.

If the new commit is 100% identical, bit-for-bit, to the original, the new hash ID is the original hash ID. Let's say that happens for A itself: there are no changes to make, so Git re-uses the ID. The repo contents now look like this:

A--B--C--D--E--F--G--H   <-- [original master]
 .
  ...<-- [new master, being built]

Then Git moves on to the next commit hash ID in the list, which is B . Let's say that the filter makes some change this time (removing a big file), so that the new commit has a new, different hash ID, which we'll call B' :

A--B--C--D--E--F--G--H   <-- [original master]
 \
  B'  <-- [new master, being built]

Filter-branch moves on to C . Even if it has no change to make to C 's snapshot, filter-branch is forced to make one change now: it must make a new C' whose parent is B' , because something happened to B . So now we get C' :

A--B--C--D--E--F--G--H   <-- [original master]
 \
  B'-C'  <-- [new master, being built]

This repeats for all the remaining commits. All of them get new hash IDs, maybe in part because something in the snapshot changed, but certainly because their parent hash also changed. At the end, git filter-branch rewrites the name master itself to point to the final copied commit, H' :

A--B--C--D--E--F--G--H   <-- [original master, now in refs/original/]
 \
  B'-C'-D'-E'-F'-G'-H'  <-- master

All of this happens purely in your local repository—no other Git, no clone of the original repository, knows that any of this has occurred.

(Note that if you do multiple filter-branch operations, each one copies the chain of commits. Some of the intermediate results may be of no real value. Git will eventually garbage collect the unused and unreachable commits, typically after about a month. Since filter-branch copies things, you will see space usage increase a bit, rather than decrease, until the eventual garbage collection and subsequent rebuilding of pack files.)

Where things went wrong

Where things went wrong is definitely not where you think; I think the problem most likely occurred here:

After that I clicked the Sync button in the GitHub Desktop client

I have never used the GitHub Desktop software, so I can't be certain of what it does when. But this is most likely when:

[something] created a new commit named Merge remote-tracking branch 'origin/master'

because git filter-branch does not do that—well, not unless you write a very complicated filter. What does do that is git merge : you connect to another Git, which still has the original AB-...-H sequence, your Git sets your origin/master to remember their H , and your Git runs a merge that connects their H to your H' :

A--B--C--D--E--F--G--H   <-- origin/master
 \                    \
  B'-C'-D'-E'-F'-G'-H'-I  <-- master

where I is a merge commit that has two parents.

How to fix it

What you'll need to do, now that the only copies of the repository you have are the "dual commits" version, is:

Start with that dual version.
Use git branch -f or git reset --hard to move at your branch name(s) to point to some commit before the merge that joins up the two separate histories.

Assuming you have only one master and that you have that checked out now, git reset is the way to go. (You can only use git branch -f on branches that do not have HEAD attached. You can only use git reset on branches that do have HEAD attached.) Find the commit you want to retain , ie, the filtered one, which will be the first parent of the merge commit, and tell Git to make the name master point to that commit, abandoning the merge. Note that this will lose any unsaved work; and this also assumes you have not made any commits atop the merge:

$ git reset --hard HEAD~1   # or HEAD^

Now the picture looks more like this:

A--B--C--D--E--F--G--H   <-- origin/master
 \
  B'-C'-D'-E'-F'-G'-H'  <-- master

which is basically the same as what you had after the series of git filter-branch commands: the only real difference is that we're showing the name origin/master as the way your Git finds commit H . (The Git over on origin is using its name master to find commit H in its repository. Your Git is remembering their master as your origin/master .)

If everything now looks good, your remaining job is to convince their Git—the one over at origin —to take your new chain of commits and to move their name master so that it points to commit H' , the final corrected copy you made of your original H . To do that, you will use git push . However...

If you just run git push origin master to send them your copies and request that they change their master to point to commit H' instead of commit H , they will say no . Making that change would cause their Git to "forget" or "abandon" commit H , which would lose commit G , which would lose commit F , and so on, all the way back to whichever commit(s), if any, you retained. But you can change your polite request, Please, if it's OK, set your master into a forceful command: Set your master ! You do this with git push --force .

It's still up to them (GitHub) to decide whether to obey , but if you control the repository over on GitHub, you can obviously set things up so that this is OK. Be aware, however, that anyone else who has a clone of the original repository still has the original AB-...-H chain of commits. They can merge that chain and politely request that GitHub, or you, take the commits they have that you don't—their merge, plus everything leading up to commit H itself—and merge it back into your master. So even though you deliberately threw away those commits, they can very easily come back to haunt you.

(It's very hard to get rid of something forever, in Git. This is generally considered a feature.)

Answer 2

Based on @torek's answer, here are the steps I will be taking to fix this issue, I will execute this later today and update this answer with the results -or edits if any- just for reference.

# make sure the current branch is the one with the duplicates, in this case it's `master`
git checkout master

# double check you are on `master`
git status

# create a new branch from `master`
git checkout -b fix-duplicates

# double check you are on `fix-duplicates`
git status

# .. -A-B- .. -C-D-E- .. -F
#      \        /
#       B- .. -C

# A = aaaaaaaa, branching starts
# B = bbbbbbbb, branching takes effect (one commit after where it started in A)
# C = cccccccc, branching ends (exclude the merge commit that cause duplicates D)
# E = eeeeeeee, one commit after the merge commit
# F = ffffffff, most recent commit

# move back to the point where the branching started
git reset --hard A

# 1) to cherry pick with new commit dates
# cherry pick all commits from where the branching started up to where the branching ends
# exclude the merge commit at the top (the one that caused the duplication)
git cherry-pick B..C

# cherry pick all commits after the the merge up to most recent commit
git cherry-pick E..F

# 2) if you want to keep the original dates, run the following scripts instead
for commit in $(git rev-list B..C)
do
    export GIT_COMMITTER_DATE=$(git log -1 --format='%at' $commit)
    git cherry-pick $commit
done

for commit in $(git rev-list E..F)
do
    export GIT_COMMITTER_DATE=$(git log -1 --format='%at' $commit)
    git cherry-pick $commit
done

# make sure the fix is good by comparing the two branches, they should be identical
git diff master..fix-duplicates

# make the fixed branch the new `master`
git checkout master
git reset --hard fix-duplicates

# review what you did (optional)
git reflog

# forcefully push the changes (make sure everything is right before this step!)
git push -f origin master

git filter-branch followed by git push caused double commits

Question

2 answers

solution1
2 ACCPTED 2018-08-01 19:47:45

Filter-branch

Where things went wrong

How to fix it

solution2
0 2018-08-03 16:43:10

git filter-branch followed by git push caused double commits

Question

2 answers

solution1 2 ACCPTED 2018-08-01 19:47:45

Filter-branch

Where things went wrong

How to fix it

solution2 0 2018-08-03 16:43:10

solution1
2 ACCPTED 2018-08-01 19:47:45

solution2
0 2018-08-03 16:43:10