简体   繁体   English

如何在 git 历史记录中查找/识别大型提交?

[英]How to find/identify large commits in git history?

I have a 300 MB git repo.我有一个 300 MB 的 git 存储库。 The total size of my currently checked-out files is 2 MB, and the total size of the rest of the git repo is 298 MB.我当前签出的文件的总大小为 2 MB,其余 git repo 的总大小为 298 MB。 This is basically a code-only repo that should not be more than a few MB.这基本上是一个不超过几 MB 的纯代码仓库。

I suspect someone accidentally committed some large files (video, images, etc), and then removed them... but not from git, so the history still contains useless large files.我怀疑有人不小心提交了一些大文件(视频、图像等),然后将它们删除......但不是从 git 中删除,所以历史记录仍然包含无用的大文件。 How can find the large files in the git history?如何在 git 历史记录中找到大文件? There are 400+ commits, so going one-by-one is not practical.有 400 多个提交,因此逐个提交是不切实际的。

NOTE : my question is not about how to remove the file , but how to find it in the first place.注意:我的问题不是关于如何删除文件,而是如何首先找到它。

🚀 A blazingly fast shell one-liner 🚀 🚀 速度极快的外壳单线 🚀

This shell script displays all blob objects in the repository, sorted from smallest to largest.此 shell 脚本显示存储库中的所有 blob 对象,从小到大排序。

For my sample repo, it ran about 100 times faster than the other ones found here.对于我的示例存储库,它的运行速度比此处找到的其他存储库快约 100 倍
On my trusty Athlon II X4 system, it handles the Linux Kernel repository with its 5.6 million objects in just over a minute .在我信赖的 Athlon II X4 系统上,它可以在一分钟内处理包含 560 万个对象的Linux 内核存储库

The Base Script基本脚本

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

When you run above code, you will get nice human-readable output like this:当你运行上面的代码时,你会得到很好的人类可读的输出,如下所示:

...
0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

macOS users : Since numfmt is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils . macOS 用户:由于numfmt在 macOS 上不可用,您可以省略最后一行并处理原始字节大小或brew install coreutils

Filtering过滤

To achieve further filtering , insert any of the following lines before the sort line .要实现进一步过滤,请sort行之前插入以下任何行

To exclude files that are present in HEAD , insert the following line:排除HEAD中存在的文件,请插入以下行:

grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') |

To show only files exceeding given size (eg 1 MiB = 2 20 B), insert the following line:仅显示超过给定大小的文件(例如 1 MiB = 2 20 B),请插入以下行:

awk '$2 >= 2^20' |

Output for Computers电脑输出

To generate output that's more suitable for further processing by computers, omit the last two lines of the base script.要生成更适合计算机进一步处理的输出,请省略基本脚本的最后两行。 They do all the formatting.他们做所有的格式化。 This will leave you with something like this:这会给你留下这样的东西:

...
0d99bb93129939b72069df14af0d0dbda7eb6dba 542455 path/to/some-image.jpg
2ba44098e28f8f66bac5e21210c2774085d2319b 12446815 path/to/hires-image.png
bd1741ddce0d07b72ccf69ed281e09bf8a2d0b2f 65183843 path/to/some-video-1080p.mp4

Appendix附录

File Removal文件删除

For the actual file removal, check out this SO question on the topic .对于实际的文件删除,请查看有关主题的这个 SO 问题

Understanding the meaning of the displayed file size了解显示的文件大小的含义

What this script displays is the size each file would have in the working directory.该脚本显示的是每个文件在工作目录中的大小。 If you want to see how much space a file occupies if not checked out, you can use %(objectsize:disk) instead of %(objectsize) .如果您想查看一个文件在未检出的情况下占用了多少空间,您可以使用%(objectsize:disk)而不是%(objectsize) However, mind that this metric also has its caveats, as is mentioned in the documentation .但是,请注意,该指标也有其注意事项,如文档中所述。

More sophisticated size statistics更复杂的尺寸统计

Sometimes a list of big files is just not enough to find out what the problem is.有时,大文件列表不足以找出问题所在。 You would not spot directories or branches containing humongous numbers of small files, for example.例如,您不会发现包含大量小文件的目录或分支。

So if the script here does not cut it for you (and you have a decently recent version of git), look into git-filter-repo --analyze or git rev-list --disk-usage ( examples ).因此,如果这里的脚本没有为您剪裁(并且您有一个相当新的 git 版本),请查看git-filter-repo --analyzegit rev-list --disk-usage示例)。

I've found a one-liner solution on ETH Zurich Department of Physics wiki page (close to the end of that page).我在ETH Zurich Department of Physics wiki 页面上找到了一个单行解决方案(接近该页面的末尾)。 Just do a git gc to remove stale junk, and then只需执行git gc即可删除陈旧的垃圾,然后

git rev-list --objects --all \
  | grep "$(git verify-pack -v .git/objects/pack/*.idx \
           | sort -k 3 -n \
           | tail -10 \
           | awk '{print$1}')"

will give you the 10 largest files in the repository.将为您提供存储库中最大的 10 个文件。

There's also a lazier solution now available, GitExtensions now has a plugin that does this in UI (and handles history rewrites as well).现在还有一个更懒惰的解决方案, GitExtensions现在有一个插件可以在 UI 中执行此操作(并处理历史重写)。

GitExtensions“查找大文件”对话框

I've found this script very useful in the past for finding large (and non-obvious) objects in a git repository:在过去,我发现此脚本对于在 git 存储库中查找大型(且不明显)对象非常有用:


#!/bin/bash
#set -x 
 
# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs
 
# set the internal field separator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';
 
# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`
 
echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."
 
output="size,pack,SHA,location"
allObjects=`git rev-list --all --objects`
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`echo "${allObjects}" | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done
 
echo -e $output | column -t -s ', '

That will give you the object name (SHA1sum) of the blob, and then you can use a script like this one:这将为您提供 blob 的对象名称 (SHA1sum),然后您可以使用如下脚本:

... to find the commit that points to each of those blobs. ...找到指向每个 blob 的提交。

Step 1 Write all file SHA1s to a text file:步骤 1将所有文件 SHA1 写入文本文件:

git rev-list --objects --all | sort -k 2 > allfileshas.txt

Step 2 Sort the blobs from biggest to smallest and write results to text file:第 2 步将 blob 从大到小排序,并将结果写入文本文件:

git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

Step 3a Combine both text files to get file name/sha1/size information:步骤 3a合并两个文本文件以获取文件名/sha1/大小信息:

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt
done;

Step 3b If you have file names or path names containing spaces try this variation of Step 3a.步骤 3b如果您有包含空格的文件名或路径名,请尝试步骤 3a 的这种变体。 It uses cut instead of awk to get the desired columns incl.它使用cut而不是awk来获得所需的列,包括。 spaces from column 7 to end of line:从第 7 列到行尾的空格:

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | cut -d ' ' -f'1,3,7-' >> bigtosmall.txt
done;

Now you can look at the file bigtosmall.txt in order to decide which files you want to remove from your Git history.现在您可以查看文件 bigtosmall.txt 以确定要从 Git 历史记录中删除哪些文件。

Step 4 To perform the removal (note this part is slow since it's going to examine every commit in your history for data about the file you identified):第 4 步执行删除(注意这部分很慢,因为它会检查历史中的每个提交以获取有关您识别的文件的数据):

git filter-branch --tree-filter 'rm -f myLargeFile.log' HEAD

Source资源

Steps 1-3a were copied from Finding and Purging Big Files From Git History步骤 1-3a 复制自从 Git 历史中查找和清除大文件

EDIT编辑

The article was deleted sometime in the second half of 2017, but an archived copy of it can still be accessed using the Wayback Machine .这篇文章在 2017 年下半年的某个时候被删除,但仍然可以使用Wayback Machine访问它的存档副本

You should use BFG Repo-Cleaner .您应该使用BFG Repo-Cleaner

According to the website:根据网站:

The BFG is a simpler, faster alternative to git-filter-branch for cleansing bad data out of your Git repository history: BFG 是 git-filter-branch 的一种更简单、更快的替代方案,用于从 Git 存储库历史记录中清除不良数据:

  • Removing Crazy Big Files删除疯狂的大文件
  • Removing Passwords, Credentials & other Private data删除密码、凭证和其他私人数据

The classic procedure for reducing the size of a repository would be:减小存储库大小的经典过程是:

git clone --mirror git://example.com/some-big-repo.git
java -jar bfg.jar --strip-biggest-blobs 500 some-big-repo.git
cd some-big-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push

If you only want to have a list of large files, then I'd like to provide you with the following one-liner:如果您只想拥有大文件的列表,那么我想为您提供以下单行:

join -o "1.1 1.2 2.3" <(git rev-list --objects --all | sort) <(git verify-pack -v objects/pack/*.idx | sort -k3 -n | tail -5 | sort) | sort -k3 -n

Whose output will be:谁的输出将是:

commit       file name                                  size in bytes

72e1e6d20... db/players.sql 818314
ea20b964a... app/assets/images/background_final2.png 6739212
f8344b9b5... data_test/pg_xlog/000000010000000000000001 1625545
1ecc2395c... data_development/pg_xlog/000000010000000000000001 16777216
bc83d216d... app/assets/images/background_1forfinal.psd 95533848

The last entry in the list points to the largest file in your git history.列表中的最后一项指向 git 历史记录中最大的文件。

You can use this output to assure that you're not deleting stuff with BFG you would have needed in your history.您可以使用此输出来确保您不会使用BFG删除历史记录中需要的内容。

Be aware, that you need to clone your repository with --mirror for this to work.请注意,您需要使用--mirror克隆您的存储库才能使其正常工作。

If you are on Windows, here is a PowerShell script that will print the 10 largest files in your repository:如果您使用的是 Windows,这里有一个 PowerShell 脚本,它将打印存储库中的 10 个最大文件:

$revision_objects = git rev-list --objects --all;
$files = $revision_objects.Split() | Where-Object {$_.Length -gt 0 -and $(Test-Path -Path $_ -PathType Leaf) };
$files | Get-Item -Force | select fullname, length | sort -Descending -Property Length | select -First 10

Powershell solution for windows git, find the largest files: windows git的powershell解决方案,找到最大的文件:

git ls-tree -r -t -l --full-name HEAD | Where-Object {
 $_ -match '(.+)\s+(.+)\s+(.+)\s+(\d+)\s+(.*)'
 } | ForEach-Object {
 New-Object -Type PSObject -Property @{
     'col1'        = $matches[1]
     'col2'      = $matches[2]
     'col3' = $matches[3]
     'Size'      = [int]$matches[4]
     'path'     = $matches[5]
 }
 } | sort -Property Size -Top 10 -Descending

Try git ls-files | xargs du -hs --threshold=1M试试git ls-files | xargs du -hs --threshold=1M git ls-files | xargs du -hs --threshold=1M . git ls-files | xargs du -hs --threshold=1M

We use the below command in our CI pipeline, it halts if it finds any big files in the git repo:我们在 CI 管道中使用以下命令,如果在 git repo 中发现任何大文件,它就会停止:

test $(git ls-files | xargs du -hs --threshold=1M 2>/dev/null | tee /dev/stderr | wc -l) -gt 0 && { echo; echo "Aborting due to big files in the git repository."; exit 1; } || true

I was unable to make use of the most popular answer because the --batch-check command-line switch to Git 1.8.3 (that I have to use) does not accept any arguments.我无法使用最流行的答案,因为--batch-check命令行切换到 Git 1.8.3(我必须使用)不接受任何参数。 The ensuing steps have been tried on CentOS 6.5 with Bash 4.1.2随后的步骤已在 CentOS 6.5 和 Bash 4.1.2 上尝试过

Key Concepts关键概念

In Git, the term blob implies the contents of a file.在 Git 中,术语blob表示文件的内容。 Note that a commit might change the contents of a file or pathname.请注意,提交可能会更改文件或路径名的内容。 Thus, the same file could refer to a different blob depending on the commit.因此,同一个文件可能会根据提交引用不同的 blob。 A certain file could be the biggest in the directory hierarchy in one commit, while not in another.在一个提交中,某个文件可能是目录层次结构中最大的,而在另一个提交中则不是。 Therefore, the question of finding large commits instead of large files, puts matters in the correct perspective.因此,寻找大提交而不是大文件的问题,将问题放在正确的角度。

For The Impatient对于不耐烦的人

Command to print the list of blobs in descending order of size is:按大小降序打印 blob 列表的命令是:

git cat-file --batch-check < <(git rev-list --all --objects  | \
awk '{print $1}')  | grep blob  | sort -n -r -k 3

Sample output:样本输出:

3a51a45e12d4aedcad53d3a0d4cf42079c62958e blob 305971200
7c357f2c2a7b33f939f9b7125b155adbd7890be2 blob 289163620

To remove such blobs, use the BFG Repo Cleaner , as mentioned in other answers.要删除此类 blob,请使用BFG Repo Cleaner ,如其他答案中所述。 Given a file blobs.txt that just contains the blob hashes, for example:给定一个只包含 blob 哈希的文件blobs.txt ,例如:

3a51a45e12d4aedcad53d3a0d4cf42079c62958e
7c357f2c2a7b33f939f9b7125b155adbd7890be2

Do:做:

java -jar bfg.jar -bi blobs.txt <repo_dir>

The question is about finding the commits, which is more work than finding blobs.问题是关于查找提交,这比查找 blob 需要更多的工作。 To know, please read on.要知道,请继续阅读。

Further Work进一步的工作

Given a commit hash, a command that prints hashes of all objects associated with it, including blobs, is:给定一个提交哈希,打印与其关联的所有对象(包括 blob)的哈希的命令是:

git ls-tree -r --full-tree <commit_hash>

So, if we have such outputs available for all commits in the repo, then given a blob hash, the bunch of commits are the ones that match any of the outputs.因此,如果我们有这样的输出可用于 repo 中的所有提交,那么给定一个 blob 哈希,这组提交就是匹配任何输出的那些。 This idea is encoded in the following script:这个想法被编码在以下脚本中:

#!/bin/bash
DB_DIR='trees-db'

find_commit() {
    cd ${DB_DIR}
    for f in *; do
        if grep -q $1 ${f}; then
            echo ${f}
        fi
    done
    cd - > /dev/null
}

create_db() {
    local tfile='/tmp/commits.txt'
    mkdir -p ${DB_DIR} && cd ${DB_DIR}
    git rev-list --all > ${tfile}

    while read commit_hash; do
        if [[ ! -e ${commit_hash} ]]; then
            git ls-tree -r --full-tree ${commit_hash} > ${commit_hash}
        fi
    done < ${tfile}
    cd - > /dev/null
    rm -f ${tfile}
}

create_db

while read id; do
    find_commit ${id};
done

If the contents are saved in a file named find-commits.sh then a typical invocation will be as under:如果内容保存在名为find-commits.sh的文件中,那么典型的调用将如下所示:

cat blobs.txt | find-commits.sh

As earlier, the file blobs.txt lists blob hashes, one per line.如前所述,文件blobs.txt列出了 blob 哈希,每行一个。 The create_db() function saves a cache of all commit listings in a sub-directory in the current directory. create_db()函数将所有提交列表的缓存保存在当前目录的子目录中。

Some stats from my experiments on a system with two Intel(R) Xeon(R) CPU E5-2620 2.00GHz processors presented by the OS as 24 virtual cores:我在具有两个 Intel(R) Xeon(R) CPU E5-2620 2.00GHz 处理器的系统上进行的实验中的一些统计数据,由操作系统呈现为 24 个虚拟内核:

  • Total number of commits in the repo = almost 11,000回购中的提交总数 = 近 11,000
  • File creation speed = 126 files/s.文件创建速度 = 126 个文件/秒。 The script creates a single file per commit.该脚本每次提交都会创建一个文件。 This occurs only when the cache is being created for the first time.这仅在第一次创建缓存时发生。
  • Cache creation overhead = 87 s.缓存创建开销 = 87 秒。
  • Average search speed = 522 commits/s.平均搜索速度 = 522 次提交/秒。 The cache optimization resulted in 80% reduction in running time.缓存优化导致运行时间减少 80%。

Note that the script is single threaded.请注意,该脚本是单线程的。 Therefore, only one core would be used at any one time.因此,任何时候都只能使用一个核心。

For Windows, I wrote a Powershell version of this answer :对于 Windows,我写了这个答案的 Powershell 版本:

function Get-BiggestBlobs {
  param ([Parameter(Mandatory)][String]$RepoFolder, [int]$Count = 10)
  Write-Host ("{0} biggest files:" -f $Count)
  git -C $RepoFolder rev-list --objects --all | git -C $RepoFolder cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | ForEach-Object {
    $Element = $_.Trim() -Split '\s+'
    $ItemType = $Element[0]
    if ($ItemType -eq 'blob') {
      New-Object -TypeName PSCustomObject -Property @{
          ObjectName = $Element[1]
          Size = [int]([int]$Element[2] / 1kB)
          Path = $Element[3]
      }
    }
  } | Sort-Object Size | Select-Object -last $Count | Format-Table ObjectName, @{L='Size [kB]';E={$_.Size}}, Path -AutoSize
}

You'll probably want to fine-tune whether it's displaying kB or MB or just Bytes depending on your own situation.您可能需要根据自己的情况微调它是显示 kB 还是 MB 或仅显示 Bytes。

There's probably potential for performance optimization, so feel free to experiment if that's a concern for you.可能存在性能优化的潜力,因此如果您担心,请随意尝试。

To get all changes, just omit | Select-Object -last $Count要获得所有更改,只需省略| Select-Object -last $Count | Select-Object -last $Count . | Select-Object -last $Count
To get a more machine-readable version, just omit | Format-Table @{L='Size [kB]';E={$_.Size}}, Path -AutoSize要获得更机器可读的版本,只需省略| Format-Table @{L='Size [kB]';E={$_.Size}}, Path -AutoSize | Format-Table @{L='Size [kB]';E={$_.Size}}, Path -AutoSize . | Format-Table @{L='Size [kB]';E={$_.Size}}, Path -AutoSize

Use the --analyze feature of git-filter-repo like this:像这样使用git-filter-repo--analyze功能:

$ cd my-repo-folder
$ git-filter-repo --analyze
$ less .git/filter-repo/analysis/path-all-sizes.txt

How can I track down the large files in the git history?如何追踪 git 历史记录中的大文件?

Start by analyzing, validating and selecting the root cause.从分析、验证和选择根本原因开始。 Use git-repo-analysis to help.使用git-repo-analysis来提供帮助。

You may also find some value in the detailed reports generated by BFG Repo-Cleaner , which can be run very quickly by cloning to a Digital Ocean droplet using their 10MiB/s network throughput.您还可以在BFG Repo-Cleaner生成的详细报告中找到一些价值,该报告可以通过使用其 10MiB/s 网络吞吐量克隆到 Digital Ocean 液滴来非常快速地运行。

I stumbled across this for the same reason as anyone else.出于与其他人相同的原因,我偶然发现了这一点。 But the quoted scripts didn't quite work for me.但是引用的脚本对我来说不太适用。 I've made one that is more a hybrid of those I've seen and it now lives here - https://gitlab.com/inorton/git-size-calc我已经制作了一个更像是我见过的那些的混合体,它现在住在这里 - https://gitlab.com/inorton/git-size-calc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM