简体   繁体   English

量化 git diff 中的变化量?

[英]Quantifying the amount of change in a git diff?

I use git for a slightly unusual purpose--it stores my text as I write fiction.我将 git 用于一个稍微不寻常的目的——它在我写小说时存储我的文本。 (I know, I know...geeky.) (我知道,我知道……极客。)

I am trying to keep track of productivity, and want to measure the degree of difference between subsequent commits.我试图跟踪生产力,并希望衡量后续提交之间的差异程度。 The writer's proxy for "work" is "words written", at least during the creation stage.至少在创作阶段,作家对“作品”的代理是“所写的文字”。 I can't use straight word count as it ignores editing and compression, both vital parts of writing.我不能直接使用字数统计,因为它忽略了编辑和压缩,这两个都是写作的重要部分。 I think I want to track:我想我想跟踪:

 (words added)+(words removed)

which will double-count (words changed), but I'm okay with that.这将重复计算(单词已更改),但我可以接受。

It'd be great to type some magic incantation and have git report this distance metric for any two revisions.输入一些魔法咒语并让 git 报告任何两个修订的距离度量会很棒。 However, git diffs are patches, which show entire lines even if you've only twiddled one character on the line;但是,git diffs 是补丁,即使您只在行上摆弄一个字符,它也会显示整行; I don't want that, especially since my 'lines' are paragraphs.我不想要那样,尤其是因为我的“行”是段落。 Ideally I'd even be able to specify what I mean by "word" (though \\W+ would probably be acceptable).理想情况下,我什至能够指定我所说的“单词”是什么意思(尽管 \\W+ 可能是可以接受的)。

Is there a flag to git-diff to give diffs on a word-by-word basis? git-diff 是否有一个标志可以逐字给出差异? Alternately, is there a solution using standard command-line tools to compute the metric above?或者,是否有使用标准命令行工具计算上述指标的解决方案?

Building on James' and cornmacrelf's input , I've added arithmetic expansion , and came up with a few reusable alias commands for counting words added, deleted, and duplicated in a git diff:James 和cornmacrelf 的 input 基础上,我添加了算术扩展,并提出了一些可重用的别名命令,用于计算在 git diff 中添加、删除和复制的单词:

alias gitwa='git diff --word-diff=porcelain origin/master | grep -e "^+[^+]" | wc -w | xargs'
alias gitwd='git diff --word-diff=porcelain origin/master | grep -e "^-[^-]" | wc -w | xargs'
alias gitwdd='git diff --word-diff=porcelain origin/master |grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs'

alias gitw='echo $(($(gitwa) - $(gitwd)))'

Output from gitwa and gitwd is trimmed using xargs trick . 使用 xargs 技巧修剪来自gitwagitwd输出。

Words duplicated added from Miles' answer .从迈尔斯的回答中添加了重复的单词。

wdiff does word-by-word comparison. wdiff逐字比较。 Git can be configured to use an external program to do the diffing. Git 可以配置为使用外部程序来进行比较。 Based on those two facts andthis blog post , the following should do roughly what you want.基于这两个事实和这篇博客文章,以下内容应该大致符合您的要求。

Create a script to ignore most of the unnecessary arguments that git-diff provides and pass them to wdiff .创建一个脚本来忽略git-diff提供的大多数不必要的参数并将它们传递给wdiff Save the following as ~/wdiff.py or something similar and make it executable.将以下内容另存为~/wdiff.py或类似内容并使其可执行。

#!/usr/bin/python

import sys
import os

os.system('wdiff -s3 "%s" "%s"' % (sys.argv[2], sys.argv[5]))

Tell git to use it.告诉git使用它。

git config --global diff.external ~/wdiff.py
git diff filename

git diff --word-diff works in the latest stable version of git (at git-scm.com) git diff --word-diff 适用于 git 的最新稳定版本(在 git-scm.com)

There are a few options that let you decide what format you want it in, the default is quite readable but you might want --word-diff=porcelain if you're feeding the output into a script.有几个选项可让您决定所需的格式,默认值非常易读,但如果您将输出输入到脚本中,您可能需要 --word-diff=porcelain。

I figured out a way to get concrete numbers by building on top of the other answers here.我想出了一种通过建立在其他答案之上来获得具体数字的方法。 The result is an approximation, but it should be close enough to serve as a useful indicator of the amount characters that were added or removed.结果是一个近似值,但它应该足够接近以作为添加或删除字符数量的有用指标。 Here's an example with my current branch compared to origin/master:这是我当前分支与 origin/master 相比的示例:

$ git diff --word-diff=porcelain origin/master | grep -e '^+[^+]' | wc -m
38741
$ git diff --word-diff=porcelain origin/master | grep -e '^-[^-]' | wc -m
46664

The difference between the removed characters ( 46664 ) and the added characters ( 38741 ) shows that my current branch has removed approximately 7923 characters.删除的字符( 46664 )和添加的字符( 38741 )之间的差异表明,我当前的分支已经删除了大约7923字符。 Those individual added/removed counts are inflated due to the diff's + / - and indentation characters, however, the difference should cancel out a significant portion of that inflation in most cases.由于差异的+ / -和缩进字符,这些单独添加/删除的计数会被夸大,但是,在大多数情况下,差异应该抵消该膨胀的很大一部分。

Git has had (for a long time) a --color-words option for git diff . Git(很长一段时间)为git diff--color-words选项。 This doesn't get you your counting, but it does let you see the diffs.这不会让您计算,但它确实让您看到差异。

scompt.com's suggestion of wdiff is also good; scompt.com对wdiff的建议也不错; it's pretty easy to shove in a different differ (see git-difftool ).推入不同的差异非常容易(请参阅git-difftool )。 From there you just have to go from the output wdiff can give to the result you really want.从那里你只需要从输出 wdiff 可以给出你真正想要的结果。

There's one more exciting thing to share, though, from git's what's cooking:不过,还有一件更令人兴奋的事情要分享,来自 git's what's Cooking:

* tr/word-diff (2010-04-14) 1 commit
  (merged to 'next' on 2010-05-04 at d191b25)
 + diff: add --word-diff option that generalizes --color-words

Here's the commit introducing word-diff .这是介绍 word-diff提交 Presumably it will make its way from next into master before long, and then git will be able to do this all internally - either producing its own word diff format or something similar to wdiff.据推测,它很快就会从 next 进入 master,然后 git 将能够在内部完成所有这些操作——要么生成自己的 word diff 格式,要么生成类似于 wdiff 的格式。 If you're daring, you could build git from next, or just merge that one commit into your local master to build.如果你有胆量,你可以从下一个构建 git,或者只是将那个提交合并到你的本地 master 来构建。

Thanks to Jakub's comment: you can further customize word diffs if necessary by providing a word regex (config parameter diff.*.wordRegex), documented in gitattributes .感谢 Jakub 的评论:如有必要,您可以通过提供单词正则表达式(配置参数 diff.*.wordRegex)进一步自定义单词差异,记录在gitattributes 中

I liked Stoutie 's answer and wanted to make it a bit more configurable to answer some word count questions I had.我喜欢Stoutie答案,并希望让它更具可配置性,以回答我遇到的一些字数统计问题。 Ended up with the following solution that works in ZSH and should work in Bash.最终得到了以下适用于 ZSH 并且应该适用于 Bash 的解决方案。 Each function takes any revision or revision difference , with a default of comparing the current state of the world with origin/master :每个函数都接受任何修订或修订差异,默认情况下将世界的当前状态与origin/master


# Calculate writing word diff between revisions. Cribbed / modified from:
# https://stackoverflow.com/questions/2874318/quantifying-the-amount-of-change-in-a-git-diff
function git_words_added {
  revision=${1:-origin/master}

  git diff --word-diff=porcelain $revision | \
    grep -e "^+[^+]" | \
    wc -w | \
    xargs
}

function git_words_removed {
  revision=${1:-origin/master}

  git diff --word-diff=porcelain $revision | \
    grep -e "^-[^-]" | \
    wc -w | \
    xargs
}

function git_words_diff {
  revision=${1:-origin/master}

  echo $(($(git_words_added $1) - $(git_words_removed $1)))
}

Then you can use it like so:然后你可以像这样使用它:


$ git_words_added
# => how many words were added since origin/master

$ git_words_removed
# => how many words were removed since origin/master

$ git_words_diff
# => difference of adds and removes since origin/master (net words)

$ git_words_diff HEAD
# => net words since you last committed

$ git_words_diff master@{yesterday}
# => net words written today!

$ git_words_diff HEAD^..HEAD
# => net words in the last commit

$ git_words_diff ABC123..DEF456
# => net words between two arbitrary commits

Hope this helps someone!希望这对某人有帮助!

The above answers fail for some use cases where you need to exclude moved text (eg, if I move a function in code or paragraph in latex further down the document, I don't want to count all of those as changes!)对于某些需要排除移动文本的用例,上述答案失败(例如,如果我将代码中的函数或乳胶中的段落进一步向下移动到文档中,我不想将所有这些都算作更改!)

For that, you can also calculate the number of duplicate lines, and exclude those from your query if there are too many duplicates.为此,您还可以计算重复行的数量,如果重复行过多,则从查询中排除这些行。

For example, building on the other answers, I can do:例如,在其他答案的基础上,我可以这样做:

git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs

calculates the number of duplicate words in the diff, where sha is your commit.计算差异中重复单词的数量,其中sha是您的提交。

You can do this for all the commits within the last day (since 6 am) by:您可以通过以下方式对最后一天(从早上 6 点开始)的所有提交执行此操作:

for sha in $(git rev-list --since="6am" master | sed -e '$ d'); do
     echo $(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^+[^+]"|wc -w|xargs),\
     $(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^-[^-]"|wc -w|xargs),\
     $(git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs)
done

Prints: added, deleted, duplicates打印:添加、删除、重复

(I take the line diff for duplicates, as it excludes the times where git diff tries to be too clever, and assumes you have actually just changed text rather than moved it. It also discounts instances where a single word is counted as a duplicate.) (我将 diff 行用于重复,因为它排除了git diff试图过于聪明的时间,并假设您实际上只是更改了文本而不是移动了它。它还减少了单个单词被视为重复的情况。 )

Or, if you want to be sophisticated about it, you can exclude commits entirely if there is more than 80% duplication, and sum up the rest:或者,如果你想更复杂一些,如果有超过 80% 的重复,你可以完全排除提交,然后总结其余的:

total=0
for sha in $(git rev-list --since="6am" master | sed -e '$ d'); do
    added=$(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^+[^+]"|wc -w|xargs)
    deleted=$(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^-[^-]"|wc -w|xargs)
    duplicated=$(git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs)
    if [ "$added" -eq "0" ]; then
        changed=$deleted
        total=$((total+deleted))
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changed:" $changed
    elif [ "$(echo "$duplicated/$added > 0.8" | bc -l)" -eq "1" ]; then
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changes counted:" 0
    else
        changed=$((added+deleted))
        total=$((total+changed))
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changes counted:" $changed
    fi
done
echo "Total changed:" $total

I have this script to do it here: https://github.com/MilesCranmer/git-stats .我有这个脚本在这里做: https : //github.com/MilesCranmer/git-stats

This prints out:这打印出来:

➜  bifrost_paper git:(master) ✗ count_changed_words "6am" 

added: 38, deleted: 76, duplicated: 3, changes counted: 114
added: 14, deleted: 19, duplicated: 0, changes counted: 33
added: 1113, deleted: 1112, duplicated: 1106, changes counted: 0
added: 1265, deleted: 1275, duplicated: 1225, changes counted: 0
added: 4207, deleted: 4208, duplicated: 4391, changes counted: 0
Total changed: 147

The commits where I am just moving around things are obvious, so I don't count those changes.我只是移动事物的提交是显而易见的,所以我不计算这些变化。 It counts up everything else and tells me the total number of changed words.它计算其他所有内容并告诉我更改的单词总数。

Sorry, I don't have enough reputation points to comment on @codebeard's answer.抱歉,我没有足够的声望点数来评论 @codebeard 的回答。 It is the one I used, and I added both of his versions to my .gitconfig file.这是我使用的那个,我将他的两个版本都添加到了我的 .gitconfig 文件中。 They gave different answers, and I traced the problem to wdiff -sd in the second version (the one that combines all modified files together) counting the words in the two lines at the top of the output of diff -pdrU3 .他们给出了不同的答案,我将问题追溯到第二个版本(将所有修改过的文件组合在一起的版本)中的wdiff -sd计算diff -pdrU3输出顶部两行中的diff -pdrU3 It will be something like:它会是这样的:

--- 1   2018-12-10 22:53:47.838902415 -0800
+++ 2   2018-12-10 22:53:57.674835179 -0800

I fixed this by piping through tail -n +4 .我通过管道通过tail -n +4解决了这个问题。

Here's my full .gitconfig settings with the fix in place:这是我的完整 .gitconfig 设置,并进行了修复:

[alias]
    wdiff = diff
    wdiffs = difftool -t wdiffs
    wdiffs-all = difftool -d -t wdiffs-all
[difftool "wdiffs"]
    cmd = wdiff -n -s \"$LOCAL\" \"$REMOTE\" | colordiff
[difftool "wdiffs-all"]
    cmd = diff -pdrU3 \"$LOCAL\" \"$REMOTE\" | tail -n +4 | wdiff -sd

If you'd rather use git config here are the commands:如果您更愿意使用git config使用以下命令:

git config --global difftool.wdiffs.cmd 'wdiff -n -s "$LOCAL" "$REMOTE"' | colordiff
git config --global alias.wdiffs 'difftool -t wdiffs'
git config --global difftool.wdiffs-all.cmd 'diff -pdrU3 "$LOCAL" "$REMOTE" | wdiff -sd'
git config --global alias.wdiffs-all 'difftool -d -t wdiffs-all'

Now, you can do git wdiffs or git wdiffs-all to get your word count since the last commit.现在,您可以执行git wdiffsgit wdiffs-all来获取自上次提交以来的字数。

To compare to origin/master, do git wdiffs origin/master or git wdiffs-all origin/master .要与 origin/master 进行比较,请执行git wdiffs origin/mastergit wdiffs-all origin/master

I like this answer the best because it gives both the word count and the diff, and if you pipe through colordiff , it comes out nice and colored.我最喜欢这个答案,因为它同时提供了字数和差异,如果您通过colordiff管道colordiff ,它会显示出漂亮的颜色。 (@Miles answer is also good but requires you to figure out what time to use. However, I like the idea of looking for moved text.) (@Miles 的回答也不错,但需要您弄清楚使用什么时间。不过,我喜欢寻找移动文本的想法。)

wdiff's stats output at the end looks like this:最后 wdiff 的 stats 输出如下所示:

file1.txt: 12360 words  12360 100% common  0 0% deleted  5 0% changed
file2.txt: 12544 words  12360 99% common  184 1% inserted  11 0% changed

To find out how many words you have added, add inserted and changed from the second line, 184+11, in the example above.要了解您添加了多少个单词,请在上面的示例中从第二行 184+11 添加insertedchanged

Why not anything from the first line?为什么不是第一行的任何内容? Answer: those are words removed.答:那些是删除的词。

Here's a bash script to get a single, unified word count:这是一个用于获取单个统一字数的 bash 脚本:

wdiffoutput=$(git wdiffs-all | tail -n 1)
wdiffins=$(echo "$wdiffoutput" | grep -oP "common *\K\d*")
wdiffchg=$(echo "$wdiffoutput" | grep -oP "inserted *\K\d*")
echo "Word Count: $((wdiffins+wdiffchg))"

Since Git 1.6.3 there is also git difftool , which can be configured to run nearly any external diff tool.从 Git 1.6.3 开始,还有git difftool ,它可以配置为运行几乎任何外部 diff 工具。 This is a lot easier than some of the solutions which require creating scripts etc. If you like the output of wdiff -s you can configure something like:这比一些需要创建脚本等的解决方案要容易得多。 如果您喜欢wdiff -s的输出,您可以配置如下内容:

git config --global difftool.wdiffs.cmd 'wdiff -s "$LOCAL" "$REMOTE"'
git config --global alias.wdiffs 'difftool -t wdiffs'

Now you can just run git difftool -t wdiffs or its alias git wdiffs .现在你可以运行git difftool -t wdiffs或者它的别名git wdiffs

If you prefer to get statistics for all modified files together, instead do something like:如果您希望将所有修改过的文件的统计信息放在一起,请执行以下操作:

git config --global difftool.wdiffs.cmd 'diff -pdrU3 "$LOCAL" "$REMOTE" | wdiff -sd'
git config --global alias.wdiffs 'difftool -d -t wdiffs'

This takes the output of a typical unified diff and pipes it into wdiff with its -d option set to just interpret the input.这需要一个典型的统一diff的输出,并将它的输出到wdiff并将其-d选项设置为仅解释输入。 In contrast, the extra -d argument to difftool in the alias tells git to copy all modified files to a temporary directory before doing the diff.相比之下,别名中difftool的额外-d参数告诉 git 在执行 diff 之前将所有修改过的文件复制到临时目录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM