简体   繁体   English

将Git存储库从ISO-8859-1切换为源代码文件的UTF-8编码

[英]Switching a Git repository from ISO-8859-1 to UTF-8 encoding for source code files

I'm going to convert a large Mercurial project to Git this weekend using fast-export . 我将在本周末使用快速导出将大型Mercurial项目转换为Git。 I've tested that several times, and the results are good. 我已经多次测试了,结果很好。

We'd also like to turn our source code encoding (lots of German comments/string literals with Umlauts) from ISO-8859-1 to UTF-8 (all other non-java files in the repo should stay as-is), and the Git migration delivers us a chance to do it now since everybody needs to clone again anyway. 我们还希望将我们的源代码编码(许多德语注释/字符串文字与Umlauts)从ISO-8859-1转换为UTF-8(repo中的所有其他非java文件应保持原样),并且Git迁移为我们提供了一个机会,因为每个人都需要再次克隆。 However, I don't find a good approach for it. 但是,我找不到一个好的方法。

  1. I've tried the git filter-tree --tree-filter ... approach from this comment on SO . 我已经尝试过git filter-tree --tree-filter ...方法来评论SO However while this seems ideal, due to the size of the repository (about 200000 commits, 18000 code files) it would take much more time than just the weekend I have. 然而,虽然这似乎是理想的,但由于存储库的大小(大约200000次提交,18000个代码文件),它将花费比我周末更多的时间。 I've tried running it (in a heavily optimized version where the list of files is chunked and the sublists are converted in parallel (using GNU parallel )) straight from a 64GB tmpfs volume on a linux VM with 72 cores, and still it would take several days... 我已经尝试运行它(在一个高度优化的版本中,文件列表被分块并且子列表并行转换(使用GNU并行 ))直接来自具有72个内核的Linux VM上的64GB tmpfs卷,但它仍然会需要几天......
  2. Alternatively, I've tried the simple approach where I perform the conversion simply on any active branch individually and commit the changes. 或者,我尝试了一种简单的方法,我只需在任何活动分支上单独执行转换并提交更改。 However, the result is not satisfying because then I almost always get conflicts when merging or cherry-picking pre-conversion commits. 但是,结果并不令人满意,因为在合并或挑选转换前提交时,我几乎总会遇到冲突。
  3. Now I'm running approach 1 again but not trying to rewrite the complete history of all branches ( --all as <rev-list> ) but just all commits reachable from the current active branches' and not reachable by some past commit which is (hopefully) a predecessor of all current branches ( branch-a branch-b branch-c --not old-tag-before-branch-abc-forked-off as <rev-list> ). 现在我再次但未运行的方法1试图重写所有分支的完整历史记录( --all<rev-list> ),但都只是一些过去的承诺是承诺从目前活跃的分支到达和不可到达(希望)所有当前分支的前身( branch-a branch-b branch-c --not old-tag-before-branch-abc-forked-off as <rev-list> )。 It's still running but I fear that I can't really trust the results as this seems like a very bad idea. 它仍在运行,但我担心我不能真正相信结果,因为这似乎是一个非常糟糕的主意。
  4. We could just switch the encoding in the master branch with a normal commit as in approach 2, but again this would make cherry-picking fixes from/to master a disaster. 我们可以像在方法2中那样使用正常提交来切换主分支中的编码,但是这也将使得从/掌握灾难的挑选修复。 And it would introduce lots of encoding problems because developers would surely forget to change their IDE settings when switching between master and non-converted branches. 它会引入许多编码问题,因为开发人员在主转换和非转换分支之间切换时肯定会忘记更改其IDE设置。

So right now, I somehow feel the best solution could be to just stick to ISO-8859-1. 所以现在,我觉得最好的解决办法就是坚持使用ISO-8859-1。

Does anyone have an idea? 有没有人有想法? Someone mentioned that maybe reposurgeon can do basically approach 1 using its transcode operation with a performance much better than git filter-tree --tree-filter ... but I have no clue how that works. 有人提到, reposurgeon可能基本上接近1使用其transcode操作,其性能比git filter-tree --tree-filter ...但我不知道它是如何工作的。

A tree filter in git filter-branch is inherently slow. git filter-branch树过滤git filter-branch本质上很慢。 It works by extracting every commit into a full blown tree in a temporary directory, letting you change every file, and then figuring out what you changed and making the new commit from every file you left behind. 它的工作原理是将每个提交提取到一个临时目录中的完整树中,让您更改每个文件,然后找出您更改的内容并从您留下的每个文件中进行新提交。

If you're exporting and importing through fast-export / fast-import, that would be the time to convert the data: you have the expanded data of the file in memory, but not in file-system form, before writing it to the export/import pipeline. 如果您通过快速导出/快速导入导出和导入, 那么就是转换数据的时间:在将文件写入内容之前,您在文件系统中将扩展的文件数据放在内存中,而不是文件系统形式。出口/进口管道。 Moreover, git fast-import itself is a shell script so it's trivial to insert filtering there, and hg-fast-export is a Python program so it's trivial to insert filtering there as well. 而且, git fast-import本身就是一个shell脚本,因此在那里插入过滤是微不足道的,而hg-fast-export是一个Python程序,因此在那里插入过滤也很简单。 The obvious place would be here : just re-encode d . 显而易见的地方就在这里 :只需重新编码d

You might consider using git filter-branch --index-filter —as opposed to --tree-filter (which is the default). 您可以考虑使用git filter-branch --index-filter -as,而不是--tree-filter (这是默认值)。 The idea is that with --index-filter , there's no checkout step (ie the worktree is not (re-)populated at all on each iteration). 这个想法是使用--index-filter ,没有结帐步骤(即每次迭代都没有(重新)填充工作树)。

So you might consider writing a filter for git filter-branch --index-filter which would use git ls-files —something like this: 所以你可以考虑为git filter-branch --index-filter编写一个过滤git filter-branch --index-filter ,它将使用git ls-files东西:

  1. Call git ls-files --cached --stage and iterate over each entry. 调用git ls-files --cached --stage并遍历每个条目。

    Consider only those which have the 100644 file mode—that is, are normal files. 仅考虑那些具有100644文件模式的文件 - 即普通文件。

  2. For each entry run something like 对于每个条目运行的东西

     sha1=`git show ":0:$filename" \\ | iconv -f iso8859-1 -t utf-8 \\ | git hash-object -t blob -w --stdin` git update-index --cacheinfo "10644,$sha1,$filename" --info-only 
  3. Rinse, repeat. 冲洗,重复。

An alternate approach I fathom would be to attack the problem from a different angle: the format of streams generated by git fast-export and consumed by git fast-import are plain text¹ (just pipe your exporter's output to less or another pager and see for yourself). 我想要的另一种方法是从不同的角度解决问题:由git fast-export生成并由git fast-import消耗的流的格式是纯文本¹(只需将导出器的输出传输给less或另一个寻呼机并查看你自己)。

You could write a filter using your favourite PL which would parse the stream, re-encode any data chunks. 您可以使用您喜欢的PL编写一个过滤器来解析流,重新编码任何data块。 The stream is organized in a way so that no SHA-1 hashes are used so you may re-encode as you go. 流的组织方式使得不使用SHA-1哈希,因此您可以随时重新编码。 The only apparent problem I fathom is that the data chunks bear no information about which file they will represent in the resulting commit (if any), so if you have non-text files in your history, you might need to either resort to guessing based on the contents of each data blob or make your processor more complicated by remembering the blobs it has seen and deciding which of them to re-encoded after it saw the commit record which assigns file names to (some of) those blobs. 我唯一明显的问题是data块没有关于它们将在结果提交中表示哪个文件的信息(如果有的话),所以如果你的历史记录中有非文本文件,你可能需要采用基于猜测的方法。关于每个数据blob的内容,或者通过记住它看到的blob并决定在看到将文件名分配给(某些)blob的commit记录后重新编码它们中的哪一个来使处理器更复杂。


¹ Documented in git-fast-import(1) —run git help fast-import . ¹用git-fast-import(1) -run git help fast-import

I had the exact same problem and the solution is based in @kostix answer of using as the basis the --index-filter option of filter-branch , but, with some additional improvements. 我有完全相同的问题,解决方案基于@kostix回答使用filter-branch--index-filter选项作为基础,但是,有一些额外的改进。

  1. Use git diff --name-only --staged to detect the contents of the staging area 使用git diff --name-only --staged来检测暂存区域的内容
  2. Iterate over this list and filter for: 迭代此列表并过滤:
    1. git ls-files $filename , ie, it isn't a deleted file git ls-files $filename ,即它不是已删除的文件
    2. the result of git show ":0:$filename" | file - --brief --mime-encoding git show ":0:$filename" | file - --brief --mime-encoding的结果git show ":0:$filename" | file - --brief --mime-encoding git show ":0:$filename" | file - --brief --mime-encoding isn't binary , ie, it is a text file, nor is already UTF-8 encoded git show ":0:$filename" | file - --brief --mime-encoding不是binary ,即它是一个文本文件,也不是UTF-8编码的
  3. Use the detected mime encoding for each file 对每个文件使用检测到的mime编码
  4. Use iconv to convert the the files 使用iconv转换文件
  5. Detect the file mode with git ls-files $filename --stage | cut -c 1-6 使用git ls-files $filename --stage | cut -c 1-6检测文件模式 git ls-files $filename --stage | cut -c 1-6

This is the look of my bash function: 这是我的bash函数的外观:

changeencoding() {
    for filename in `git diff --name-only --staged`; do
        # Only if file is present, i.e., filter deletions
        if [ `git ls-files $filename` ]; then
            local encoding=`git show ":0:$filename" | file - --brief --mime-encoding`
            if [ "$encoding" != "binary" -a  "$encoding" != "utf-8" ]; then
                local sha1=`git show ":0:$filename" \
                    | iconv --from-code=$encoding --to-code=utf-8 \
                    | git hash-object -t blob -w --stdin`
                local mode=`git ls-files $filename --stage | cut -c 1-6`
                git update-index --cacheinfo "$mode,$sha1,$filename" --info-only
            fi
        fi
    done
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM