[英]Switching a Git repository from ISO-8859-1 to UTF-8 encoding for source code files
I'm going to convert a large Mercurial project to Git this weekend using fast-export . 我将在本周末使用快速导出将大型Mercurial项目转换为Git。 I've tested that several times, and the results are good. 我已经多次测试了,结果很好。
We'd also like to turn our source code encoding (lots of German comments/string literals with Umlauts) from ISO-8859-1 to UTF-8 (all other non-java files in the repo should stay as-is), and the Git migration delivers us a chance to do it now since everybody needs to clone again anyway. 我们还希望将我们的源代码编码(许多德语注释/字符串文字与Umlauts)从ISO-8859-1转换为UTF-8(repo中的所有其他非java文件应保持原样),并且Git迁移为我们提供了一个机会,因为每个人都需要再次克隆。 However, I don't find a good approach for it. 但是,我找不到一个好的方法。
git filter-tree --tree-filter ...
approach from this comment on SO . 我已经尝试过git filter-tree --tree-filter ...
方法来评论SO 。 However while this seems ideal, due to the size of the repository (about 200000 commits, 18000 code files) it would take much more time than just the weekend I have. 然而,虽然这似乎是理想的,但由于存储库的大小(大约200000次提交,18000个代码文件),它将花费比我周末更多的时间。 I've tried running it (in a heavily optimized version where the list of files is chunked and the sublists are converted in parallel (using GNU parallel )) straight from a 64GB tmpfs volume on a linux VM with 72 cores, and still it would take several days... 我已经尝试运行它(在一个高度优化的版本中,文件列表被分块并且子列表并行转换(使用GNU并行 ))直接来自具有72个内核的Linux VM上的64GB tmpfs卷,但它仍然会需要几天...... --all
as <rev-list>
) but just all commits reachable from the current active branches' and not reachable by some past commit which is (hopefully) a predecessor of all current branches ( branch-a branch-b branch-c --not old-tag-before-branch-abc-forked-off
as <rev-list>
). 现在我再次但未运行的方法1试图重写所有分支的完整历史记录( --all
为<rev-list>
),但都只是一些过去的承诺是承诺从目前活跃的分支到达和不可到达(希望)所有当前分支的前身( branch-a branch-b branch-c --not old-tag-before-branch-abc-forked-off
as <rev-list>
)。 It's still running but I fear that I can't really trust the results as this seems like a very bad idea. 它仍在运行,但我担心我不能真正相信结果,因为这似乎是一个非常糟糕的主意。 So right now, I somehow feel the best solution could be to just stick to ISO-8859-1. 所以现在,我觉得最好的解决办法就是坚持使用ISO-8859-1。
Does anyone have an idea? 有没有人有想法? Someone mentioned that maybe reposurgeon can do basically approach 1 using its transcode
operation with a performance much better than git filter-tree --tree-filter ...
but I have no clue how that works. 有人提到, reposurgeon可能基本上接近1使用其transcode
操作,其性能比git filter-tree --tree-filter ...
但我不知道它是如何工作的。
A tree filter in git filter-branch
is inherently slow. git filter-branch
树过滤git filter-branch
本质上很慢。 It works by extracting every commit into a full blown tree in a temporary directory, letting you change every file, and then figuring out what you changed and making the new commit from every file you left behind. 它的工作原理是将每个提交提取到一个临时目录中的完整树中,让您更改每个文件,然后找出您更改的内容并从您留下的每个文件中进行新提交。
If you're exporting and importing through fast-export / fast-import, that would be the time to convert the data: you have the expanded data of the file in memory, but not in file-system form, before writing it to the export/import pipeline. 如果您通过快速导出/快速导入导出和导入, 那么就是转换数据的时间:在将文件写入内容之前,您在文件系统中将扩展的文件数据放在内存中,而不是文件系统形式。出口/进口管道。 Moreover, git fast-import
itself is a shell script so it's trivial to insert filtering there, and hg-fast-export
is a Python program so it's trivial to insert filtering there as well. 而且, git fast-import
本身就是一个shell脚本,因此在那里插入过滤是微不足道的,而hg-fast-export
是一个Python程序,因此在那里插入过滤也很简单。 The obvious place would be here : just re-encode d
. 显而易见的地方就在这里 :只需重新编码d
。
You might consider using git filter-branch --index-filter
—as opposed to --tree-filter
(which is the default). 您可以考虑使用git filter-branch --index-filter
-as,而不是--tree-filter
(这是默认值)。 The idea is that with --index-filter
, there's no checkout step (ie the worktree is not (re-)populated at all on each iteration). 这个想法是使用--index-filter
,没有结帐步骤(即每次迭代都没有(重新)填充工作树)。
So you might consider writing a filter for git filter-branch --index-filter
which would use git ls-files
—something like this: 所以你可以考虑为git filter-branch --index-filter
编写一个过滤git filter-branch --index-filter
,它将使用git ls-files
东西:
Call git ls-files --cached --stage
and iterate over each entry. 调用git ls-files --cached --stage
并遍历每个条目。
Consider only those which have the 100644
file mode—that is, are normal files. 仅考虑那些具有100644
文件模式的文件 - 即普通文件。
For each entry run something like 对于每个条目运行的东西
sha1=`git show ":0:$filename" \\ | iconv -f iso8859-1 -t utf-8 \\ | git hash-object -t blob -w --stdin` git update-index --cacheinfo "10644,$sha1,$filename" --info-only
Rinse, repeat. 冲洗,重复。
An alternate approach I fathom would be to attack the problem from a different angle: the format of streams generated by git fast-export
and consumed by git fast-import
are plain text¹ (just pipe your exporter's output to less
or another pager and see for yourself). 我想要的另一种方法是从不同的角度解决问题:由git fast-export
生成并由git fast-import
消耗的流的格式是纯文本¹(只需将导出器的输出传输给less
或另一个寻呼机并查看你自己)。
You could write a filter using your favourite PL which would parse the stream, re-encode any data
chunks. 您可以使用您喜欢的PL编写一个过滤器来解析流,重新编码任何data
块。 The stream is organized in a way so that no SHA-1 hashes are used so you may re-encode as you go. 流的组织方式使得不使用SHA-1哈希,因此您可以随时重新编码。 The only apparent problem I fathom is that the data
chunks bear no information about which file they will represent in the resulting commit (if any), so if you have non-text files in your history, you might need to either resort to guessing based on the contents of each data blob or make your processor more complicated by remembering the blobs it has seen and deciding which of them to re-encoded after it saw the commit
record which assigns file names to (some of) those blobs. 我唯一明显的问题是data
块没有关于它们将在结果提交中表示哪个文件的信息(如果有的话),所以如果你的历史记录中有非文本文件,你可能需要采用基于猜测的方法。关于每个数据blob的内容,或者通过记住它看到的blob并决定在看到将文件名分配给(某些)blob的commit
记录后重新编码它们中的哪一个来使处理器更复杂。
¹ Documented in git-fast-import(1)
—run git help fast-import
. ¹用git-fast-import(1)
-run git help fast-import
。
I had the exact same problem and the solution is based in @kostix answer of using as the basis the --index-filter
option of filter-branch
, but, with some additional improvements. 我有完全相同的问题,解决方案基于@kostix回答使用filter-branch
的--index-filter
选项作为基础,但是,有一些额外的改进。
git diff --name-only --staged
to detect the contents of the staging area 使用git diff --name-only --staged
来检测暂存区域的内容 git ls-files $filename
, ie, it isn't a deleted file git ls-files $filename
,即它不是已删除的文件 git show ":0:$filename" | file - --brief --mime-encoding
git show ":0:$filename" | file - --brief --mime-encoding
的结果git show ":0:$filename" | file - --brief --mime-encoding
git show ":0:$filename" | file - --brief --mime-encoding
isn't binary
, ie, it is a text file, nor is already UTF-8 encoded git show ":0:$filename" | file - --brief --mime-encoding
不是binary
,即它是一个文本文件,也不是UTF-8编码的 git ls-files $filename --stage | cut -c 1-6
使用git ls-files $filename --stage | cut -c 1-6
检测文件模式 git ls-files $filename --stage | cut -c 1-6
This is the look of my bash function: 这是我的bash函数的外观:
changeencoding() {
for filename in `git diff --name-only --staged`; do
# Only if file is present, i.e., filter deletions
if [ `git ls-files $filename` ]; then
local encoding=`git show ":0:$filename" | file - --brief --mime-encoding`
if [ "$encoding" != "binary" -a "$encoding" != "utf-8" ]; then
local sha1=`git show ":0:$filename" \
| iconv --from-code=$encoding --to-code=utf-8 \
| git hash-object -t blob -w --stdin`
local mode=`git ls-files $filename --stage | cut -c 1-6`
git update-index --cacheinfo "$mode,$sha1,$filename" --info-only
fi
fi
done
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.