I'm going to convert a large Mercurial project to Git this weekend using fast-export . I've tested that several times, and the results are good.
We'd also like to turn our source code encoding (lots of German comments/string literals with Umlauts) from ISO-8859-1 to UTF-8 (all other non-java files in the repo should stay as-is), and the Git migration delivers us a chance to do it now since everybody needs to clone again anyway. However, I don't find a good approach for it.
git filter-tree --tree-filter ...
approach from this comment on SO . However while this seems ideal, due to the size of the repository (about 200000 commits, 18000 code files) it would take much more time than just the weekend I have. I've tried running it (in a heavily optimized version where the list of files is chunked and the sublists are converted in parallel (using GNU parallel )) straight from a 64GB tmpfs volume on a linux VM with 72 cores, and still it would take several days... --all
as <rev-list>
) but just all commits reachable from the current active branches' and not reachable by some past commit which is (hopefully) a predecessor of all current branches ( branch-a branch-b branch-c --not old-tag-before-branch-abc-forked-off
as <rev-list>
). It's still running but I fear that I can't really trust the results as this seems like a very bad idea. So right now, I somehow feel the best solution could be to just stick to ISO-8859-1.
Does anyone have an idea? Someone mentioned that maybe reposurgeon can do basically approach 1 using its transcode
operation with a performance much better than git filter-tree --tree-filter ...
but I have no clue how that works.
A tree filter in git filter-branch
is inherently slow. It works by extracting every commit into a full blown tree in a temporary directory, letting you change every file, and then figuring out what you changed and making the new commit from every file you left behind.
If you're exporting and importing through fast-export / fast-import, that would be the time to convert the data: you have the expanded data of the file in memory, but not in file-system form, before writing it to the export/import pipeline. Moreover, git fast-import
itself is a shell script so it's trivial to insert filtering there, and hg-fast-export
is a Python program so it's trivial to insert filtering there as well. The obvious place would be here : just re-encode d
.
You might consider using git filter-branch --index-filter
—as opposed to --tree-filter
(which is the default). The idea is that with --index-filter
, there's no checkout step (ie the worktree is not (re-)populated at all on each iteration).
So you might consider writing a filter for git filter-branch --index-filter
which would use git ls-files
—something like this:
Call git ls-files --cached --stage
and iterate over each entry.
Consider only those which have the 100644
file mode—that is, are normal files.
For each entry run something like
sha1=`git show ":0:$filename" \\ | iconv -f iso8859-1 -t utf-8 \\ | git hash-object -t blob -w --stdin` git update-index --cacheinfo "10644,$sha1,$filename" --info-only
Rinse, repeat.
An alternate approach I fathom would be to attack the problem from a different angle: the format of streams generated by git fast-export
and consumed by git fast-import
are plain text¹ (just pipe your exporter's output to less
or another pager and see for yourself).
You could write a filter using your favourite PL which would parse the stream, re-encode any data
chunks. The stream is organized in a way so that no SHA-1 hashes are used so you may re-encode as you go. The only apparent problem I fathom is that the data
chunks bear no information about which file they will represent in the resulting commit (if any), so if you have non-text files in your history, you might need to either resort to guessing based on the contents of each data blob or make your processor more complicated by remembering the blobs it has seen and deciding which of them to re-encoded after it saw the commit
record which assigns file names to (some of) those blobs.
¹ Documented in git-fast-import(1)
—run git help fast-import
.
I had the exact same problem and the solution is based in @kostix answer of using as the basis the --index-filter
option of filter-branch
, but, with some additional improvements.
git diff --name-only --staged
to detect the contents of the staging area git ls-files $filename
, ie, it isn't a deleted file git show ":0:$filename" | file - --brief --mime-encoding
git show ":0:$filename" | file - --brief --mime-encoding
isn't binary
, ie, it is a text file, nor is already UTF-8 encoded git ls-files $filename --stage | cut -c 1-6
git ls-files $filename --stage | cut -c 1-6
This is the look of my bash function:
changeencoding() {
for filename in `git diff --name-only --staged`; do
# Only if file is present, i.e., filter deletions
if [ `git ls-files $filename` ]; then
local encoding=`git show ":0:$filename" | file - --brief --mime-encoding`
if [ "$encoding" != "binary" -a "$encoding" != "utf-8" ]; then
local sha1=`git show ":0:$filename" \
| iconv --from-code=$encoding --to-code=utf-8 \
| git hash-object -t blob -w --stdin`
local mode=`git ls-files $filename --stage | cut -c 1-6`
git update-index --cacheinfo "$mode,$sha1,$filename" --info-only
fi
fi
done
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.