简体   繁体   中英

How to retain pre-rename history while moving few files from one GIT repository to another?

Summary of question

I need to move few files from one repository to another, while keeping their history of changes. I already moved them in a source repository into a dedicated folder with git mv (per Greg Bauer's widely quoted post , which results in all pre-folder-move history not copied into the target repository when following Greg's script.

I have only master branch in each of the repositories involved.

In case of first source repository, original files used to reside in root folder before moving to the dedicated folder.

In case of a second source repository, (other) original files used to reside in a first-level folder which also stores many other files (which I don't need to move).

The target repository already has some other files and folders which I need to preserve, with its history of commits.

Finally, if everything is copied correctly to the destination repository, I need a clean way to delete (hide?) the original files from the source repositories.

Update 2019-03-25 12:00 UTC: Some more details on my situation, following torek's brilliant explanation :

  1. I was and am the only user of all three repositories in question (both source repositories and one target repository); each of the source repositories is used on a single workstation
  2. One source repository is hosted at GitHub; another at GitLab (as a private project). The target repository is hosted at GitLab, as a private project. There is no "multiple repositories storing the same commit" at this point, if I understand correctly what 'repository' specifically means.
  3. My local '.git' folders for these repositories are quite small; the biggest is only 12 MB on disk, 2.5K files. So performance don't seem to be a huge issue.
  4. What I am most interested in seeing in target repository is: (a) essential, diff "before vs after" of files in question; (b) important enough, timestamp of the original change; (c) nice to have, name of the original commiter (always myself)
  5. In future situations, I will need to migrate from a private repository (containing other private files) to a public repository which should not have any mention of those private files or theier contents. This is not a casr in my today's specific situation, however.

Something I considered--but failed to use "off the shelf":

I am not familiar with how GIT repository is structured, so 'git ls-files ... | grep ... INDEX_FILE ... git update-index ... 'git ls-files ... | grep ... INDEX_FILE ... git update-index ... from Stage 1, step 5 sounds like a magic for me.

From answer to another question , it's unclear whether it will help with individual files already moved into a dedicated folder (and/or whether it's safe to rollback the move prior to migration).

Also, how can I decide between not/using these steps :

git reflog expire --expire=now --all
git reset --hard
git gc --aggressive
git prune

I'm also struggling to compile a single script from a set of snippets in this post which also seems be somewhat relevant.

No answer is going to be completely satisfactory to everyone in every case. That's because you literally cannot copy file history from one Git repository to another, for the simple reason that Git does not have file history. You cannot remove a file from (existing) history for different, but related, reason. But what you can get may be good enough.

Git history is commits, and commits are immutable

As I've said many times before, Git's raison d'être is the commit. What Git does is store commits, plus a little extra to make those more useful. The extra part means that sometimes , you can do something that's sufficient for what you want—though this of course depends on precisely what you want—or, perhaps, what you'll settle for. Let's take a close look at commits and see just how they are history.

Each commit is a mostly-stand-alone entity. A commit saves a complete snapshot of all files—all files as of that commit, that is—along with some metadata . Each unique commit is uniquely identified by its hash ID. Here's an actual commit from the Git repository for Git itself (with @ changed to space to maybe, perhaps, cut down a bit on spam):

$ git cat-file -p b5101f929789889c2e536d915698f58d5c5c6b7a | sed 's/@/ /'
tree 3f109f9d1abd310a06dc7409176a4380f16aa5f2
parent a562a119833b7202d5c9b9069d1abb40c1f9b59a
author Junio C Hamano <gitster pobox.com> 1548795295 -0800
committer Junio C Hamano <gitster pobox.com> 1548795295 -0800

Fourth batch after 2.20

Signed-off-by: Junio C Hamano <gitster pobox.com>

That's not how GitHub displays it, of course, but that's the internal Git object that stores the commit, in its entirety. The saved snapshot is obtained via the tree line. The parent line lists the commit that comes before this commit , which itself is a merge commit, so that it has two parent lines.

The important things here are:

  • The commit is identified by its hash ID, eg, b5101f929789889c2e536d915698f58d5c5c6b7a . That's how any and every Git in the universe knows whether it has this commit: either you do have that hash ID, so you have this commit, or you don't, so you don't.

  • The commit lists a tree , which is the saved snapshot.

  • The commit lists the hash ID(s) of its parent or parents.

What this means is that Git needs only the hash ID of the last commit. Suppose we represent this big ugly hash ID with a single letter such as H (for hash ). We say that commit H stores the hash ID of its parent, which we'll represent as G instead of another big ugly string. Then commit H points to commit G :

          G <-H

But G is a commit. That means it store the hash ID of its parent, which we can call F :

... <-F <-G <-H

and of course F stores the hash ID of E , and so on, in a backwards-looking chain. The chain can fork and re-combine, and if we were going forwards instead of backwards, forking would occur when we make branches, and re-combining would occur when we merge the branches. But since Git actually works backwards, the forking happens at the merges; the re-combining happens when we run out of merged stuff:

             I--J
            /    \
...--F--G--H      M--N--...--T   <-- master
            \    /
             K--L

In any case, this chain is the Git history. The item that provides the hash ID of the last commit in the chain is, as suggested in the drawing above, a branch name such as master .

This is all that Git has. There is no file history, there are only commits. We find the commits by starting at a tip commit, like T , whose hash ID we find by name, like master . We add new history—new commits—to the repository by making a new commit U whose parent is T , and then changing the name master to point to the new commit U .

Commits are immutable because their real names—their hash IDs—are computed by running a cryptographic checksum over all of the contents of the commit. If we were to take the commit above and change anything about it—such as the stored date-stamps on the author or committer line, or the log message, or the snapshotted tree —we'd have to compute a new checksum over the new data. That checksum would be different, and rather than changing the existing commit H , we would just have a new commit H' :

...--F--G--H--I--J   <-- master
         \
          H'  <-- need-a-name-here

This new commit H' has G as its parent, so H' is just a branch. We must now invent a branch name so as to store the hash ID of new commit H' that is a copy of H but with something changed. We did not change any commit, we just added a new commit.

But I can run git log --follow somefile.ext , isn't that file history?

Maybe it is! But it's not stored in Git . What's stored in Git is the commits. What git log did was to start at some branch name, like master , and find the commit there—the tip commit of the branch. That commit has a hash ID and a log message and a snapshot. Of course, Git was able to find that commit's parent commit, as saved in the tip commit.

Now comes the tricky part. This all happens in a big loop, working on each commit, one commit at a time. Git chooses whether or not to show the commit it's working on , and for git log somefile.ext :

  • Git extracts the parent commit snapshot to a temporary area.

  • Git extracts the commit's snapshot to a temporary area.

    (It doesn't really extract the commits, but if you think of it this way, it may make more sense. In reality it just compares in-tree hash IDs, which suffices. Later, if you've asked git log to show diffs, it really does do some partial extraction. But this is all just an optimization, really.)

  • Now git log compares the two snapshots. Did somefile.ext change? If so, show this commit.

  • Having shown, or not shown, this commit, move to the commit's parent.

Without --follow , that's all that git log somefile.ext does. You see a synthetic "file history", consisting of the subset of the commit history in which the file has changed from parent to child. That's it! What you saw was selected commit history . You can call that "file history" if you like, but it's computed dynamically, from the commit history that Git actually keeps.

Adding --follow tells git log to do one more thing: while comparing the two commits, check to see if the comparison suggests that in the parent commit, somefile.ext had a different path name . If the parent commit called the file oldname.dat , for instance, git log --follow switches names when it moves back one step in the commit history.

There are some hitches here, especially around merge commits. A merge commit is a commit with two parents instead of just one. Git literally cannot show both paths simultaneously—it's marching back through commit history, one commit at a time. So, when it hits these merges—which is where history diverges because Git works backwards—it typically picks just one leg of the history to follow.

(The details here get quite complicated. See the History Simplification section of the git log documentation, but it's heavy going. When run without specific file names, so as to show all the commits, git log by default goes down both legs of the merge, in a manner that's a little hard to describe properly: we have to introduce the notion of a priority queue here. Linear history, with no merges, avoids all of this messiness, and is easier to think about.)

Now back to the problem at hand

Let's go back to the original, concise, summary of the desired outcome:

I need to move few files from one repository to another, while keeping their history of changes.

That is, we want files taken from commits from RepoA to somehow appear in commits that are in RepoB.

We can immediately see the problem: the history of those files is really all the commits in RepoA , or at best, some subset of commits from RepoA . Each of those commits is a full snapshot of all of its files.

Moreover, if we take those snapshots—either as a whole, or in some reduced form—and put them in RepoB, those snapshots aren't the same as any of the existing snapshots in RepoB. Let's take a simple concrete example where RepoA has four snapshots ABCD in a nice linear chain, and RepoB has another four EFGH likewise:

RepoA:

A--B--C--D   <-- master

RepoB:

E--F--G--H   <-- master

If we just copy all the commits from RepoA to RepoB unmodified, we get this in RepoB:

E--F--G--H   <-- master

A--B--C--D   <-- invent-a-name-here

That's clearly not what we want. We can do something and that's what all the answers you have been looking at are about.

What we can do here

If we want somefile.ext out of RepoA, and it's first created in commit B and then modified in commit D , what we can do is make two new commits I and J that have only that one file . We can make them anywhere—all Gits are equal—so let's make a RepoC by cloning RepoA, then make them in RepoC, mostly just for illustration:

$ git clone <url-of-RepoA> repo-c
$ cd repo-c
$ git checkout --orphan for-transplanting
$ git rm -rf .                              # empty the index and work-tree
$ git checkout <hash-of-B> -- somefile.ext  # get the first copy of the file
$ git commit -m 'initial commit of somefile.ext'  # and commit it
$ git checkout master -- somefile.ext       # get the 2nd and last copy
$ git commit -m 'update somefile.ext'       # and commit that one

Now RepoC contains:

A--B--C--D   <-- master, origin/master

I--J   <-- for-transplanting

We can now copy commits I and J into RepoB:

$ cd <path-to-repo-B>
$ git fetch <path-to-repo-C> for-transplanting:for-transplanting

which gives us this in RepoB:

E--F--G--H   <-- master

I--J   <-- for-transplanting

where commits I and J have the file we want.

That file is on the J -then- I -then-stop history , which consists of those two commits. (The git checkout --orphan trick made sure that when we made commit I , it had no parent—it was a root commit, just like the very first commit we would make in a new, empty repository. Remember, all commits, with their unique hash IDs, are universal across every Git repository: you either have that commit, with its hash ID, or you don't. RepoB did not have them, and now, after git fetch , RepoB has them.)

These histories are, clearly, unrelated: there is no way to hop from J to the H -and-back chain, nor vice versa. But we can now tell Git to "marry" commits H and J , making a new commit K :

$ git checkout master
$ git merge --allow-unrelated-histories for-transplant

This uses a (non-existent, faked via the empty tree ) truly empty commit as the merge base, so that all files in H are newly created and all files in J (just the one file) are (is) newly-created. It combines these changes— add all files to nothing and add somefile.ext to nothing —which is easy to do, applies those changes to the empty tree that has no files, and commits the result as new commit K :

E--F--G--H--K   <-- master
           /
I---------J   <-- for-transplanting

The synthetic "file history" of your new file somefile.ext is now found by looking at K , seeing that the file exists in J but not in H , and following that leg backwards. The file exists in I and J and is different, so commit J gets shown. Then Git goes to I . The file doesn't exist in the nonexistent commit before I , so it's clearly different in I and commit I gets shown. Then there's no more commit to go back to, so git log stops.

Note that we could make I and J in RepoA directly. Or, we could copy all of RepoA 's commits ( ABCD ) into RepoB , then make I and J in RepoB, then remove all traces of names that led to commits ABCD . The now-unused/unreferenced commits will eventually go away for real (typically some time after 30 days), and in the meantime you won't see them and they won't bother you; they'll just consume a small bit of disk space. The real advantage to using RepoC is that we can experiment there, and if things have gone wrong, just blow the whole thing away and start over.

Now you have a harder problem

Finally, if everything is copied correctly to the destination repository, I need a clean way to delete (hide?) the original files from the source repositories.

There isn't one. There are only dirty ways. How dirty, or how bad the dirt is, depends on your needs.

Again, the original repository has all of its commits. All of them have all the files. In our example, we made the simplifying supposition that there were four commits:

A--B--C--D   <-- master

with somefile.ext first appearing in B , remaining unchanged in D , and then being stored with different contents in D .

Since the file isn't in A , you can retain commit A . But you must build a replacement B' that's like B —has the same metadata, including parent A , as before—but has a saved snapshot that omits the file:

A--B--C--D   <-- master
 \
  B'  <-- ??? (we'll get to this)

Having made B' from B , you now need to make a new commit C' that's like C except for two things:

  • its parent is B' rather than B , and
  • it omits somefile.ext

Once you have made that copy C' of C , you have:

A--B--C--D   <-- master
 \
  B'-C'  <-- ??? (we'll get to this)

Now you must copy D to D' the same way:

A--B--C--D   <-- master
 \
  B'-C'-D'  <-- ??? (we'll get to this)

and now it's time to get to the what branch name goes in the question marks issue.

The obvious thing to do is to peel the branch name master off commit D and make it point to D' instead:

A--B--C--D   [abandoned]
 \
  B'-C'-D'  <-- master

Anyone who comes along now and looks at this repository will start with the name master to get the hash ID of D' . They won't even notice that D' has a totally different hash ID than D . They will look at D' and move back to C' , and from there to B' and then back to A .

Well, almost anyone. What if another Git comes along? What if that other Git already has ABCD ? That Git has them and knows them by their hash IDs. Hash IDs are the universal currency of Git exchanges.

The other Gits that may come along are any clone you made of the original repository. All of the clones of RepoA have the original hash IDs, listed under their own name master . You must now convince all of those clones to switch their master from D to your new replacement D' .

If you're willing to do that—and they are too—then you have your answer: do this to RepoA and get everyone to switch. That leaves only the necessary mechanism: how will you do this to RepoA, and for that matter, how will you get the right commits into RepoC if you don't do it manually?

git filter-branch

Git has a built in command that can do this: git filter-branch . The filter-branch command works by copying commits. Logically (though not physically except for the slowest filter, --tree-filter ), what filter-branch does is:

  • check out each commit;
  • apply your filters;
  • map the original commit's parent hash according to the map-so-far; and
  • build a new commit from the filtered result, and enter <oldhash, newhash> into the map.

If the new commit is 100%, bit-for-bit identical to the original commit, it winds up being the original commit. The map entry says that commit A remains commit A . The filter for commit B makes a change—it removes the file. So the parent for the next commit is A (because A maps to A ) but the new commit gets a new hash ID, B' , and now the map says A = A but B = B' . Now the filter for C happens, removing the file and making the parent of the new commit be B' , so that the result is a new commit C' and that goes into the map. Last, the filter for D happens, making new commit D' with parent C' .

Now that all the commits are filtered, git filter-branch uses the built-up map to replace the hash ID stored under master . The map says that D becomes D' so filter-branch stores D' 's hash under the name master , and we have what we wanted.

This same technique can be used in RepoC. Remember that RepoC is a temporary, where we can wreak whatever havoc we like. Instead of removing somefile.ext , what we want to do, in our filter, is remove everything except somefile.ext . We'll also almost certainly want the --prune-empty argument.

What --prune-empty does is simply enough to describe. Let's start with how things work without --prune-empty . During the copying process, each original commit gets copied to a new one. That's true even if the new commit, after applying the filter(s), makes no changes . If we have a commit like C that doesn't touch somefile.ext , it probably touches other files instead. (Git normally won't let you make two commits in a row that have the same content—you have to use git commit --allow-empty to make that happen.) But if we remove all the other files ... well, then we effectively have B and C being the same , so after we copy B to B' to have only somefile.ext , we'll copy C to C' to have only somefile.ext . The two copies will match. By default, filter-branch will make C' anyway, so that C has something to map to.

Adding --prune-empty tells Git: Don't make C' , just map C to B' instead. When we do that, we get exactly what we want: Git doesn't make A' at all, makes B' —which we're calling I instead—from B with I having no parent, doesn't make C' , and makes D' —which we're calling J —from D using B' , er, I , as its parent:

RepoC:

A--B--C--D   [abandoned]

   I-----J   <-- master

What's left to do

All that remains is to figure out how to write the filters for git filter-branch . That's what the existing answers you are reading are about.

The easy filter to use is --tree-filter . When you use this filter, Git runs your shell script fragment in a temporary directory. That temporary directory has all the files from the commit being filtered (but has no .git directory, and is not your work-tree!). Your filter just needs to modify the files in place, or remove some files or add some files. Git will make a new commit from whatever your filter leaves in that temporary directory.

This is also the slowest filter, by far. When using this on a big repository, be prepared to wait hours or days. (It helps to use the -d argument to point git filter-branch to a memory-based "file system" in which to do all its work, but it's still very slow.) So most of the answers concentrate on figuring out how to jigger one of the other, faster filters to do the job.

You can choose to work with those, or to use the really slow --tree-filter . Either way, if you use filter-branch, you now know what you're doing and why.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM