git equivalent of “hg strip”

Question

This question -- or the substantially similar "how to undo git fetch " -- has been asked a handful of times before, but the answers have always been too simplistic. In particular, the most popular answer ( git reset --hard ) doesn't touch the repository at all.

The answer I'm looking for should leave the local git repo (not the working directory, not the index; I don't really care what state those end up in) at the same state (bit-for-bit, ideally) it was in prior to the fetch. I would accept a re-clone of upstream or a modification of the local repo. I'd be okay with a solution that ignores multiple branches, multiple remotes, etc, but a more full-fledged solution would be nice.

The tests I've been using are:

git show shows $want
git log starts at $want
git show $latest errors
git tag -l $latesttagname shows nothing
git show $latesttag errors
git fetch -v clearly downloads all the stuff I stripped away (it should not say "up to date")

where $latest is the commit hash of the head after the fetch I'm trying to undo, $latesttag is the commit hash of the newest tagged commit in the set of commits that needs to get stripped, $latesttagname is the name of that tag, and $want is the commit hash of the old head that I want to be the new head once its descendants are stripped.

The things I tried:

git reset --hard $want : passes only the first two tests
Adding 'git prune -nv' doesn't help
Neither does git gc --prune=now
If I clone the repo after doing the reset , then in the child the third test passes and the sixth appears (after setting the remote url to the appropriate upstream)
What makes all the tests pass is to run git tag -d $(git tag --contains $want | tail +2) in my local parent before cloning it; in particular, the commits previously referenced by tags are no longer referenced, so I'm guessing they get left behind. But it suggests that there may be other reference types I'm neglecting.

I brought out the big hammers and did something that kinda seemed to approximate what hg strip does (apologies for the zshisms in there):

 tags=( $(git tag --contains $want | tail +2) ) mv .git/objects/pack . git unpack-objects < pack/pack*.pack rm -rf pack git log --pretty=format:%H $want..HEAD | while read i; do rm .git/objects/$i[1,2]/$i[3,-1]; done git update-ref refs/remotes/origin/master $want git update-ref refs/heads/master $want for i in refs/tags/$^tags; do git update-ref -d $i; done git repack -ad

which did pretty much what you'd expect a hammer to do -- break a bunch of things: git fsck --unreachable complains about a bunch of invalid reflog entries and unreachable blobs (and one dangling commit), but cloning makes these go away.

It seems to me that the reset + tag removal + clone is the best of my options here, but a) what other refs do I need to remove/repoint, and b) is there a way to avoid the need for a clone? Or is there another way entirely?

Apologies for the length of this. I wanted to be as clear as possible about what I was looking for (and what I wasn't) and what I'd already found to have failed (and why). If you got this far, thank you.

Answer 1

TL;DR answer: your reset, remove any tags that point into the reset-away area, and clone method works and is probably the way to go.

If you want to clean out an existing repo, you'll (at least potentially) need git for-each-ref and reflog cleaning (of each ref with a reflog, and then also of HEAD ).

It's possible (but noticeably more difficult) to do without cloning, and you're on the right track with git gc --prune=now (or --prune=all as the documentation says; though it may be better to use an explicit git repack and git prune , and you probably also need to clear away some reflog entries).

The "big hammer" method may be too aggressive as it can ² remove objects that are needed as they are the same as objects in commits you're retaining. (Edit after realizing you're just removing commit objects: and, it will probably leave objects behind, although they will not be seen in normal usage. Whether this is a big problem depends on how important it is to remove those objects. See the general theory section below.)

Cloning is easier because a clone brings over only those objects reachable from the references it brings over. That is, you control what gets added, rather than what must be avoided. Your method (remove any leftover tags and make sure no branch contains any of the commits you want gone) will suffice as long as you use a smart protocol and/or expand any packs (see the pack section below).

What gets copied

Somewhat formally, suppose the entire repository graph is G (this is all the objects—commits, tags, trees, and blobs—stored in the repository). A clone finds a subset g of G by starting with the branch names being cloned, ¹ finding the objects those name, and then recursively adding any objects reachable from those objects. Hence if refs/heads/master points to some commit, you get that commit, its tree, any subtrees in its tree, and all blobs associated with those trees; plus the commits that are its parents, their trees, and subtrees and blobs, and so on. Repeat for every name in refs/heads/ and you have the basic sub-graph g .

(A fetch uses the same process except that the receiver tells the sender which SHA-1 IDs he has as existing branch tips. The sender can use this to exclude objects reachable from those SHA-1s, provided the sender has those SHA-1s. This is only true of the "smart" protocols, but in general you should use smart protocols.)

Both clone and fetch take a hybrid approach to tags by default. If you add --tags , they simply use all the tags presented by the remote in addition to the branch names (incidentally, you can use git ls-remote to see all references a remote provides). If you use --no-tags , they ignore the tags entirely. Otherwise they ignore the tags initially, build the sub-graph g as usual, then add as references any tags that point to objects already in g . For lightweight tags, this adds no objects, but for each annotated tag, this adds the annotated tag object to g (and if the tag points to something other than a commit, eg to another tag — you can get more objects here as well).

Once Git has constructed the clone or fetch graph g , it extracts the objects, makes a new pack out of them, and sends that pack. This means you don't need to worry about an existing pack containing a "stale" object (which might, for instance, contain sensitive data).

Without cloning

If you don't want to clone, though, you must consider all the references Git can see. Both git gc and git prune (which git gc runs for you) will delete unreferenced loose objects. Obviously this means that "being referenced" is one key here. But there's another saving throw, if you will, for loose objects: Git won't delete them if they're very young. (This protects objects being generated before their references get inserted into the repo.) Using --prune=all (or --expire now ) will kill off these objects. Besides these, there's a separate issue with "packed" objects—but let's leave that alone for now.

The other standard references to objects, besides branch, and tag names, are:

remote-tracking branches (anything in refs/remotes/ )
notes (anything in refs/notes )
the stash ( refs/stash )
Git's various special-case references: HEAD , FETCH_HEAD , ORIG_HEAD , MERGE_HEAD , and CHERRY_PICK_HEAD (these are the only normal ones, but any file in .git can act as a reference: see gitrevisions

But to be safe you should probably look at .git/refs and .git/packed-refs , or better, and simpler, use git for-each-ref . For each reference discovered, see if it points to a commit you intend to discard; if so, delete (or re-point) that ref. This also covers the case for remote-tracking branch origin/foo (which is just refs/remotes/origin/foo ) when you're deleting or rewinding foo on origin and don't want those commits showing up.

Next, you'll need to look at the reflogs. If you delete a ref, Git (currently) deletes its reflog as well, so this is only for those references that were not deleted in the first pass. A reflog is basically a flat history of each ref. For instance, refs/remotes/origin/foo may have named commit 1234567... before the fetch you're un-doing / stripping, and now names commit fedcb9a... . Its reflog will contain 1234567... . If you want that commit stripped, you need to remove this reflog entry. You can do this with git reflog delete ref@{number} ; see the git reflog documentation . Normally there is a reflog for each branch and each remote-tracking branch, plus one for HEAD .

Packs

It looks like you already took care of this with git unpack-objects , but I'll go over this for completeness.

Even once you have all the references cleaned out, there's still that packed-objects issue: Git may keep unreferenced objects inside packs. To get rid of them you must use git repack -a -d to force Git to make a new single pack file and delete any existing pack files. I'm not sure whether this will remove pack files that have an associated .keep file ( git gc won't and git gc mostly just runs other commands anyway, so it's possible that git repack won't either).

¹ And sometimes tags (if you use --tags ). Normally the branch names selection is "all branch names"—everything in refs/heads/ —but you can instruct git clone to clone only one branch, with --single-branch .

² The likelihood of this depends on the age of the objects: the oldest objects that get reused wind up in pack files, where they would have been safe if you hadn't unpacked them. :-) Edit: also, I misread the example at first: you're only stripping out commit objects, so as long as they're not shared across other branches you should be OK.

git equivalent of “hg strip”

Question

1 answers

solution1
0 ACCPTED 2015-11-19 04:36:15

What gets copied

Without cloning

Packs

git equivalent of “hg strip”

Question

1 answers

solution1 0 ACCPTED 2015-11-19 04:36:15

What gets copied

Without cloning

Packs

solution1
0 ACCPTED 2015-11-19 04:36:15