In git, while converting a subfolder into a submodule, also keep the history from files that were moved into the target folder

Question

Similar to the process defined at https://gist.github.com/korya/9047870 using the command

git filter-branch --subdirectory-filter sub/module/path HEAD -- --all

In my history there are files that were in other folders and were moved into that folder, ex.

Created testfile.txt
Modified testfile.txt
Moved testfile.txt to /sub/module/path/testfile.txt
Modified /sub/module/path/testfile.txt

I would want the history of that file (and any other file that exists in the sub/module/path) to exist in the new resulting repository.

Answer 1

TL;DR

You can't get what you want—at least, not without writing some sort of at least semi-fancy tool yourself.

You might be able to get what you need easily. You'll have to think about what you need, and decide whether to try to write a semi-fancy tool.

Long

Git does not have file history . Git has commits , and the commits are the history. (Compare with, eg, ClearCase, which does have true file history, with all that this implies.)

In Git, each commit has a list of predecessor or parent commits, and each commit holds a full snapshot of all files. So in your example, there are four commits—or at least four interesting ones. I'll assume here that there are a total of five commits and that we can draw those commits like this:

A <-B <-C <-D <-E   <--master

The name master holds the actual hash ID of the last commit E . That commit contains files. It also contains the raw hash ID of its parent commit D .

For simplicity, let's say all the commits contain one other file, named README.md . Commit A consists only of this README.md , ie, if we git checkout commit A , we'll get a work-tree with just one file, README.md .

In commit B , you added a file named testfile.txt . You did this with:

... create the file ...
git add testfile.txt
git commit -m "second commit"

This made commit B , which pointed back to existing commit A . Commit B now contains two files, README.md —unchanged from commit A (and actually re-used in Git's internal stored format), and testfile.txt .

You then modified the work-tree copy of testfile.txt , used git add again, and ran git commit to create commit C . Commit C now points back to commit B ; commit C contains both README.md (still unchanged) and a new version of testfile.txt .

At this point, you ran:

mkdir sub/module/path
git mv testfile.txt
git commit -m "fourth commit"

(or anything equivalent) to make commit D , which points back to C . Commit D contains two files: README.md (still unchanged), and sub/module/path/testfile.txt : a file with a long name with slashes in it. The contents of the second file are the same as the contents of the shorter-named file in commit C but the name is different.

Last, you modified the work-tree file named testfile the work-tree directory/folder named sub/module/path , used git add on it, and ran git commit to make commit E . E points back to D and contains two files.

Given this history—this series of commits—you now tell Git:

Use the name master to find the last commit.
For each commit, look at the parent-child pair and see if it modifies the file named sub/module/path/testfile.txt in some way:
- If so, print the name (hash ID) of the child commit, its log message, and maybe also the kind of change to the file.
- If the kind of change is rename , start looking for the old name now.
In any case, move on to the previous commit, if there is one. Stop when you run out of commits.

(That's your git log --follow -- sub/module/path/testfile.txt command.)

You're now converting to a submodule. A submodule is a Git repository.

Each future submodule git checkout set of files will reside within the sub/module/path sub-directory of the superproject work-tree, so that if the submodule contains a commit that contains a file named testfile.txt , that file will appear in sub/module/path/testfile.txt . If the submodule contains a commit that contains a file named sub/module/path/testfile.txt , that file will appear in sub/module/path/sub/module/path/testfile.txt , which is not what you want.

Your job is therefore to make a series of commits that are a new repository. In this series of commits, the file will just be named testfile.txt . This new repository will probably have all-new commits: in this case, none of the hash IDs in this new repository will match any of the hash IDs in the original repository.

It is your choice as to whether to keep some or all of the files from original commit B and if so, what to do about the fact that, in commit B , the file you care about is named testfile.txt rather than sub/module/path/testfile.txt . Similarly, you can keep some or all of original files from commit C .

In any case you'll keep parts of commits D and E in a sort of easier way: just throw out everything that's not sub/module/path/ and strip the sub/module/path/ part from the file names.

If you do keep some or all of (the files from) commits B and/or C , the testfile.txt in the two kept commits has to be named testfile.txt so that it lands in the right place. The strip-leading- sub/module/path/ trick automatically gives the right names for the remaining commits.

The transformation command you use to copy the original series of commits to a new series of commits can be git filter-branch with its --subdirectory-filter . But the subdirectory filter cannot keep these parts of commits B and C for you. Essentially, git filter-branch and its subdirectory filter are just not that smart. What filter-branch does for you is this:

Start with the first commit, and go in a forwards direction (this is rare in Git because Git is bad at this: Git strongly prefers to work backwards).
For each commit:
- apply some filter(s);
- use the result to make a new commit, or to skip the commit entirely;
- automatically link new commits to the new chains of commits being made, ie, substitute in the correct backwards-pointing linkages.
Repeat for all commits leading to the chosen branch names, or all branch names, as ending commits.
Last, store the final filtered commit hash ID in each branch name.

If your input series of commits is:

A--B--C--G   <-- branch1
       \
        D--E--F   <-- branch2

and your filter keeps B (with some changes made) and all subsequent commits (perhaps with other changes made as well), the final result is:

A--B--C--G   [abandoned]
       \
        D--E--F   [abandoned]

B'-C'-G'  <-- branch1
    \
     D'-E'-F'  <-- branch2

Now, working the way Git normally does, starting at the name branch1 and working backwards, we see the copied-and-filtered commits B'-C'-G' (in the other order), and working from branch2 we see B'-C'-D'-E'-F' (in the other order). So git filter-branch has now done its job. If we push the new commit-chains and two names to a new repository, we have a repository that no longer has commit A at all.

(Note that all the original commits still exist. We just can't see them. If we clone this filtered clone again, they all drop out and are really gone. Or, we can remove the breadcrumb trail that filter-branch leaves behind in case you want to undo the effect, and Git will eventually clean out the original commits.)

The --subdirectory-filter in filter-branch works by throwing away all files not in the chosen subdirectory prefix, and renaming the remaining files to strip off the chosen prefix. If the result of thowing out but these files is "no files at all" or "the same as the previously stripped-down commit", the commit itself gets thrown out as well. But this tosses out the copy of testfile.txt that wasn't in subdirectory.

Normally, that's what one wants , because the original repository still exists and still has that file in the "pre-submodule" commits. You are not changing those commits; in fact, you can't change any commit, ever. That's why Git is doing all this copying: it literally must. The best we can get is new commits forming a new history that we (and Git) find by starting at updated names and working backwards, as Git does.

But it's not what you want. It might be enough—it might be all you really need—in which case the existing subdirectory filter will work for you.

filter-branch does have a general purpose "arbitrary script" option

There are these two built in filters that git filter-branch supports:

--index-filter
--tree-filter

Both of these take a command-line-style command to run. This command can use any program you write in any language, or just be a series of shell commands. The key difference between these two is how they run your command—the environment in which the commands operate.

(You can instead use the new git filter-repo command, which is written in Python and does the same kind of thing filter-branch does, but lets you execute Python functions. I do not have any examples of how to use it though, and it's not built in to Git yet: you have to install it separately.)

The index filter is much faster, but also much harder to write. To understand how to use it, it helps to understand the tree filter first.

The tree filter is simple to use. What filter-branch does, before running your tree filter command, is that it extract the entire snapshot to a temporary directory-tree.

(This temporary tree is not your work-tree. Do not expect it to be your work-tree, It is in a temporary sub-directory. hidden somewhere you won't expect, Assume nothing about it, except that it has all your files, and only your files, from that commit. extracted into folders however your OS requires.)

Your command's job is now: do whatever you like to those files . You can edit them in place, rename them, change their permissions to add or remove the "executable" flag ( chmod them), and so on. Whatever files you leave in this tree will go into the replacement commit that filter-branch will make. So you can rename files, and remove files. For instance, you could check to see if testfile.txt exists at the top level, and if so, leave it in place. You could remove all other file that aren't in sub/module/path and then move all the sub/module/path files to the top level. That would be very likely what you would want in the new replacement commit, here.

Then, having done all of this, your command should finish up with a successful status. If you write a program to do the work, use the OS-level exit(0) function. If this is a shell script, such as /tmp/shuffle-the-files.sh , have it exit with status zero.

The tree-filter will now say to itself: Ah, command succeeded; I now make a new commit from exactly the set of files that remain in the hidden temporary directory.

The filter-branch code will repeat this process for every commit in the chains to be copied . This can take a long time: hours, or days. But eventually, you have the new commits, made by copying the originals, and git filter-branch updates the branch names as described.

The index filter is the same as the tree filter, but instead of:

extract entire snapshot to temporary area
run arbitrary command
turn temporary area into new snapshot

the index filter makes use of Git's index. Git reads the commit to copy into its index—just as it would for a regular git checkout , really—which is very fast. Then it runs your command. Your command's job is to update the index in place . You can remove or rename files within the index, and then exit zero. Git then makes the new replacement commit from whatever is in the index, which is very fast. So the index filter is generally hundreds of times faster than the tree filter.

Unfortunately, the only rename file in index tool that exists in standard Git is git mv and it demands that the file exist in the work-tree as well, which it won't. So, to use the index filter, you'll have to do some fancy git update-index work, which probably means writing a program. If you only have a few hundred commits, or even a few thousand, you are probably better off with the tree filter, which is much easier to use.

(The general slowness and difficulty-of-use of git filter-branch is why it is being phased out now in favor of git filter-repo .)

In git, while converting a subfolder into a submodule, also keep the history from files that were moved into the target folder

Question

1 answers

solution1
1 ACCPTED 2020-04-29 22:48:17

TL;DR

Long

filter-branch does have a general purpose "arbitrary script" option

In git, while converting a subfolder into a submodule, also keep the history from files that were moved into the target folder

Question

1 answers

solution1 1 ACCPTED 2020-04-29 22:48:17

TL;DR

Long

filter-branch does have a general purpose "arbitrary script" option

solution1
1 ACCPTED 2020-04-29 22:48:17