[英]Reduce size of git repository on Bitbucket
After few months of (commit & push) for my project, the size of the repository gets increased gradually on Bitbucket! 经过几个月(提交和推送)我的项目,存储库的大小在Bitbucket上逐渐增加! it's about 1 GB, I tried to remove some databases folders that are not important to be added. 它大约是1 GB,我试图删除一些不重要的数据库文件夹。 After searching I found most of suggestions is proposing : 搜索后我发现大多数建议都是建议:
git filter-branch -f --tree-filter 'rm -rf folder/subfolder' HEAD
After removing few folders I push the change to the repository by -- force, as 删除几个文件夹后,我将更改推送到存储库 - 强制,如
git push origin master --force
I finally found that the repository gets larger every time I use those commands !!. 我终于发现每次使用这些命令时存储库都会变大!! Visibly, the repository gets larger 2.5 GB!! 可见,存储库变大了2.5 GB!
Any suggestion please ? 有什么建议吗?
EDIT Depending on the suggestion below, I tried the following commands 编辑根据下面的建议,我尝试了以下命令
(for all large files) (适用于所有大文件)
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" --tag-name-filter cat -- --all git filter-branch --index-filter“git rm -rf --cached --ignore-unmatch $ files”--tag-name-filter cat - --all
(remove the temporary history git-filter-branch otherwise leaves behind for a long time) (删除临时历史git-filter-branch,否则长时间留下)
rm -rf .git/refs/original/ rm -rf .git / refs / original /
git reflog expire --all
git gc --aggressive --prune
But the folder .git/objects has still a big size !!!! 但文件夹.git / objects仍然很大!!!!
OK, given your answer to your comment, we can now say what happened. 好的,鉴于您对评论的回答,我们现在可以说出发生了什么。
What git filter-branch
does is to copy (some or all of) your commits to new ones, then update the references. git filter-branch
作用是将您的提交(部分或全部) 复制到新的提交,然后更新引用。 This means your repository gets bigger (not smaller), at least initially. 这意味着您的存储库变得更大 (不小),至少在最初阶段。
The commits that are copied are those reachable via the references given. 复制的提交是通过给定的引用可以访问的提交。 In this case, the reference you gave is HEAD
(which git turns into "your current branch", probably master
, but whatever your current branch was at the time of the filter-branch
command). 在这种情况下,你给出的引用是HEAD
(git变成“你当前的分支”,可能是master
,但无论你当前的分支是在filter-branch
命令时)。 If (and only if) the new copy is precisely, bit-for-bit identical to the original, then it actually is the original and there is no actual copy made (the original is reused instead). 如果(并且仅当)新副本精确地,与原始副本一点一点地相同,那么它实际上是原始副本并且没有实际副本(原件被重新使用)。 However, as soon as you make any change—such as removing folder/subfolder
, from that point on these really are copies. 但是,只要您进行任何更改 - 例如删除folder/subfolder
,从这一点开始就是这些副本。
The copied stuff is, in this case, smaller, because you've removed some items. 在这种情况下,复制的东西更小,因为你删除了一些项目。 (It's generally not very much smaller since git compresses items pretty well.) But you're still adding more stuff to the repository: new commits, which refer to new trees, which—fortunately—refer to the same old blobs (file objects) as before, just slightly fewer of them this time (the objects for the folder/subfolder
files are still in the repository, but the copied commits and tree-objects no longer refer to them). (它通常不是很小,因为git很好地压缩项目。)但是你仍然在向存储库添加更多东西:新提交,它引用新树,幸运的是 - 引用相同的旧blob(文件对象)和以前一样,这次只是少了一些( folder/subfolder
文件folder/subfolder
文件的对象仍在存储库中,但复制的提交和树对象不再引用它们)。
Pictorially, at this point in the filter-branch
process, we now have both the old commits: 从图形上看,在filter-branch
过程的这一点上,我们现在都有旧的提交:
R--o--o---o--o <-- master
\ /
o--o <-- feature
and the new ones (I'll assume folder/subfolder
appeared in the original root commit R
so that we have a copy R'
here): 和新的(我假设folder/subfolder
出现在原始的根提交R
所以我们在这里有一个副本R'
):
R'-o'-o'--o'-o'
\ /
o'-o'
What filter-branch
does now, at the end of the copying process, is re-point some references (branch and tag names, mainly). 在复制过程结束时, filter-branch
现在做了什么,重新指向一些引用(主要是分支和标记名称)。 The ones it re-points are the ones you tell it to, by mentioning them as what the documentation calls "positive references". 它重新指出的那些是你告诉它的那些,通过提及它们作为文档所谓的“积极参考”。 In this case, if you were on master
(ie, HEAD
was another name for master
), the single positive reference you gave is master
... so that's all filter-branch
re-points. 在这种情况下,如果你在master
(即HEAD
是master
另一个名字),你给出的单个正参考是master
...所以这是所有 filter-branch
重新点。 It also makes backup references whose name starts with refs/original/
. 它还会生成名称以refs/original/
开头的备份引用。 This means you now have the following commits: 这意味着您现在拥有以下提交:
R--o--o---o--o <-- refs/original/refs/heads/master
\ /
o--o <-- feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o'
Note that feature
still points to all the old (not-copied) commits, so that even if / after you get rid of any refs/original/
references, git will retain all the still-referenced commits across any garbage-collect activity, giving: 请注意,该feature
仍然指向所有旧的 (未复制的)提交,因此即使在您删除任何refs/original/
references之后,git也将保留所有仍然引用的任何垃圾收集活动的提交, :
R--o
\
o--o <-- feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o'
To get filter-branch
to update all the references, you need to name them all. 要使filter-branch
更新所有引用,您需要将它们全部命名。 An easy way to do that is to use --all
, which quite literally names all references. 一个简单的方法是使用--all
,它完全命名所有引用。 In this case, the initial "after" picture looks like this instead: 在这种情况下,最初的“之后”图片看起来像这样:
R--o--o---o--o <-- refs/original/refs/heads/master
\ /
o--o <-- refs/original/refs/heads/feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o' <-- feature
Now if you erase all the refs/original/
references, all the old commits become unreferenced and can get garbage-collected. 现在,如果删除所有refs/original/
引用,则所有旧提交都将被取消引用,并且可以进行垃圾回收。 Well, that is, they do unless there are tags pointing to them. 嗯,就是说, 除非有标签指向它们, 否则它们会这样做。
For tag references, filter-branch
only updates them in any way if you supply a --tag-name-filter
. 对于标记引用, filter-branch
仅在您提供--tag-name-filter
以任何方式更新它们。 Usually you want --tag-name-filter cat
, which keeps the tag names unchanged, but makes filter-branch
point them to the newly copied commits. 通常你需要--tag-name-filter cat
,它保持标签名称不变,但是使filter-branch
将它们指向新复制的提交。 That way you don't hang on to the old commits: the whole point of the exercise is to make everything use the new copies, and throw away the old copies, so that the big-file objects can be garbage-collected. 这样你就不会挂起旧的提交了:练习的重点是让一切都使用新的副本,然后丢弃旧的副本,这样大文件对象就可以被垃圾收集了。
Putting this all together, instead of: 把这一切放在一起,而不是:
git filter-branch -f --tree-filter 'rm -rf folder/subfolder'
you can use: 您可以使用:
git filter-branch -f --tree-filter 'rm -rf folder/subfolder' \
--tag-name-filter cat -- --all
(You don't need the backslash-newline sequence; I put that in just to make the line fit better on stackoverflow. Note that --tree-filter
is very slow: for this particular case it is much faster to use --index-filter
. The index filter command here would be git rm --cached --ignore-unmatch -r folder/subfolder
.) (你不需要反斜杠 - 换行符序列;我把它放在只是为了使行更符合stackoverflow。请注意--tree-filter
非常慢:对于这种特殊情况,它使用--index-filter
要快得多--index-filter
。这里的索引过滤器命令是git rm --cached --ignore-unmatch -r folder/subfolder
。)
Note also that you need to do all this on (a copy of) the original repository (you did keep a backup, right?). 另请注意,您需要在原始存储库(副本)上执行所有这些操作(您确实保留了备份,对吧?)。 (If you did not keep a backup, the refs/originals/
may be your salvation.) (如果你没有备份, refs/originals/
可能是你的救赎。)
Edit: OK, so you did some filter-branch
-ing, and you did something that deleted any refs/originals/
. 编辑:好的,所以你做了一些filter-branch
-ing,你做了一些删除任何refs/originals/
。 (In my experiment on a temp repo, running git filter-branch
on HEAD
used whatever branch I was on as the branch that was re-pointed, and made an "originals" copy of the previous value.) There are no backups of the repository. (在我对temp repo的实验中,在HEAD
上运行git filter-branch
使用我所在的任何分支作为重新指向的分支,并创建了前一个值的“原始”副本。)没有备份库。 Now what? 怎么办?
Well, as a first step, make a backup now . 那么,作为第一步,立即进行备份 。 This way, if things get any worse, you can at least get back to "only slightly bad". 这样,如果事情变得更糟,你至少可以回到“只是稍微糟糕”。 To make a backup of the repo, you can simply clone it (or: clone it, then call the original the "backup", then begin working on the clone). 要备份repo,您可以简单地克隆它(或者:克隆它,然后将原始文件称为“backup”,然后开始处理克隆)。 For future reference, since git filter-branch
can be quite destructive, it's usually wise to start by doing this backing-up process. 为了将来参考,由于git filter-branch
可能具有很强的破坏性,因此通常可以从这个备份过程开始。 (Also, I'll note that a clone on bitbucket, when not yet push
ed-to, would serve. Unfortunately you did a push
. Perhaps bitbucket can retrieve an earlier version of the repository from some backups or snapshots of their own.) (另外,我会注意到bitbucket上的一个克隆,当还没有push
ed-to时,会服务。不幸的是你做了push
。也许bitbucket可以从他们自己的一些备份或快照中检索存储库的早期版本。)
Next, let's note a peculiarity of commits and their SHA-1 "true names", that I mentioned earlier. 接下来,让我们注意一下提交的特性及其SHA-1“真实姓名”,我之前提到过。 The SHA-1 name of a commit is a cryptographic checksum of its contents. 提交的SHA-1名称是其内容的加密校验和。 Let's take a look at a sample commit in git's own source tree (trimmed down a bit just for length, and email addresses whacked to foil harvesters): 让我们看看git自己的源代码树中的一个示例提交(只是为了长度而修剪了一下,并且电子邮件地址被打到了收割机):
$ git cat-file -p 5de7f500c13c8158696a68d86da1030313ddaf69
tree 73eee5d136d2b00c623c3fceceffab85c9e9b47e
parent c4ad00f8ccb59a0ae0735e8e32b203d4bd835616
author Jeff King <peff peff.net> 1405233728 -0400
committer Junio C Hamano <gitster pobox.com> 1406567673 -0700
alloc: factor out commit index
We keep a static counter to set the commit index on newly
allocated objects. However, since we also need to set the
[snip]
Here, we can see that the contents of this commit (whose "true name" is 5de7f50...
) start with a tree
and another SHA-1, a parent
and another SHA-1, an author
and committer
, then a blank line followed by the commit message text. 在这里,我们可以看到此提交的内容(其“真实名称”为5de7f50...
)以tree
和另一个SHA-1, parent
和另一个SHA-1, author
和committer
,然后是空行然后是提交消息文本。
If you look at a tree
you'll see that it contains the "true names" (SHA-1 values) of sub-trees (sub-directories) and file objects ("blobs", in git terminology) along with their modes—really, just whether the blob should have execute permission set, or not—and their names within the directory. 如果你看一tree
你会看到它包含子树(子目录)的“真实名字”(SHA-1值)和文件对象(git术语中的“blob”)及其模式 -实际上,只是blob是否应该具有执行权限集,以及它们在目录中的名称。 For instance, the first line of the above tree
is: 例如,上面tree
的第一行是:
100644 blob 5e98806c6cc246acef5f539ae191710a0c06ad3f .gitattributes
which means that the repository object 5e98806...
should be extracted, put in a file named .gitattributes
, and set non-executable. 这意味着应该提取存储库对象5e98806...
,放入名为.gitattributes
的文件中,并设置不可执行文件。
If I ask git to make a new commit, and set up, as its contents: 如果我要求git进行新的提交,并设置,作为其内容:
73eee5d...
) 同一棵树( 73eee5d...
) c4ad00f...
) 同一个父母( c4ad00f...
) then when I get git to write that commit to the repository, it will generate the same "true name" 5de7f50...
. 然后,当我得到git将该提交写入存储库时,它将生成相同的“真实名称” 5de7f50...
In other words, it literally is the same commit: it's already in the repository and git commit-tree
will just give me back the existing ID. 换句话说,它实际上是相同的提交:它已经在存储库中, git commit-tree
将只返回现有的ID。 While it's a bit tricky to set all this up, that's exactly what git filter-branch
ends up doing: it extracts the original commit, applies your filters, sets up everything, and then does a git commit-tree
. 虽然设置这一切有点棘手,但这正是git filter-branch
最终要做的事情:它提取原始提交,应用过滤器,设置所有内容,然后执行git commit-tree
。
On your original repo, you ran a git filter-branch
command that copied commits to new, modified commits (with different tree
s and hence, at some point, different true names which led to different parent IDs in subsequent commits, and so on). 在您的原始仓库中,您运行了一个git filter-branch
命令,该命令将提交复制到新的,已修改的提交(使用不同的tree
,因此,在某些时候,不同的真实名称会导致后续提交中的不同父ID,依此类推) 。 However, if you copy those copied commits by applying a filter that this time does nothing , then the new tree
objects will be the same as the old ones. 但是,如果通过应用此次不执行任何操作的过滤器来复制这些复制的提交,则新的tree
对象将与旧的tree
对象相同 。 If the new parent is the same, and the author, committer, and message also all remain the same, the new commit-ID for the copy will be the same as the old ID. 如果新父级是相同的,并且作者,提交者和消息也保持不变,则副本的新提交ID将与旧ID 相同 。
That is, these new copies are not copies after all, they're just the originals again! 也就是说,这些新副本毕竟不是副本,它们只是原件!
Any other commits—those that were not copied in the first pass—do get copied, and hence have different IDs. 任何其他提交 - 未在第一次传递中复制的提交都会被复制,因此具有不同的ID。
Here's where things get tricky. 事情变得棘手。
If your current repository looks like this (graphically speaking): 如果您当前的存储库看起来像这样 (从图形上讲):
R--o--o---o--o <-- xxx [needs a name so that filter-branch will process it]
\ /
o--o <-- feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o'
and we apply a new filter-branch
to all references (or even "all but master
") in such a way that it generates the same trees this time, it will copy R
again and the new tree will match that for R'
, so the copy will actually be R'
. 我们将这个新的filter-branch
到所有引用(甚至是“除了master
”之外的所有引用),这次它会生成相同的树,它会再次复制R
并且新树将与R'
匹配,所以副本实际上是 R'
。 Then it will copy the first post- R
node, make the same changes, and the copy will actually be the first post- R'
, o'
node. 然后它将复制第一个后R
节点,进行相同的更改,副本实际上将是第一个后R'
, o'
节点。 This will repeat for all nodes, possibly even including R'
and all the o'
s. 这将重复所有节点,甚至可能包括R'
和所有o'
。 If filter-branch
copies R'
, the resulting copy will just be R'
again, though, because "remove nonexistent directory" makes no change: our filter does nothing to these particular commits. 如果filter-branch
副本R'
,则生成的副本将再次成为R'
,因为“删除不存在的目录”不做任何更改:我们的过滤器对这些特定提交不执行任何操作。
Finally, filter-branch will move the labels, leaving the refs/originals/
versions behind: 最后,filter-branch将移动标签,留下refs/originals/
versions:
R--o--o---o--o <-- refs/originals/refs/xxx
\ /
o--o <-- refs/originals/refs/feature
R'-o'-o'--o'-o' <-- master, xxx
\ /
o'-o' <-- feature
This is, in fact, the desired outcome. 事实上,这是理想的结果。
What if the repository looks more like this? 如果存储库看起来更像这样怎么办? That is, what if there is no xxx
or similar label pointing to the original (pre-filtering) master
, so that you have this: 也就是说,如果没有xxx
或类似的标签指向原始(预过滤) master
,那么你有这个:
R--o
\
o--o <-- feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o'
The filter-branch
script will still copy R
and the result will still be R'
. filter-branch
脚本仍然会复制R
,结果仍然是R'
。 Then it will copy the first o
node and the result will still be the first o'
node, and so on. 然后,它会复制第一o
节点,结果依然会是第一个o'
点,依此类推。 It won't copy the now-deleted nodes, but it won't have to: we already have those, reachable via the branch-name master
. 它不会复制现在删除的节点,但它不必:我们已经拥有那些,可通过branch-name master
。 As before, filter-branch
may copy R'
and the various o'
nodes, but this is OK, as the filter will do nothing so that the copies are really just the originals after all. 和以前一样, filter-branch
可以复制R'
和各种o'
节点,但这没关系,因为过滤器什么也不做,所以副本实际上只是原件。
Last, filter-branch
will, as usual, update the references: 最后, filter-branch
将像往常一样更新引用:
R--o
\
o--o <-- refs/originals/refs/feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o' <-- feature
The key that makes this all work is that the filter leaves already-modified commits untouched, so that their second "copies" are just the first-copies again. 使这一切工作的关键是过滤器保留已修改的提交不变,因此它们的第二个“副本”再次只是第一个副本。 1 1
Once everything is done, you can do the same shrinking described in the git filter-branch
documentation to ditch the refs/originals/
names and garbage-collect the now-unreferenced objects. 完成所有操作后,您可以执行git filter-branch
文档中描述的相同收缩,以抛弃refs/originals/
names并垃圾收集now-unreferenced对象。
1 If you had been using a filter that is not as easily repeated (eg, one that makes new commits with "the current time" as their time-stamps), you would really need an untouched original repository, or those refs/originals/
references (either one would suffice to keep an "original copy" around). 1如果你一直在使用一个不那么容易重复的过滤器(例如,以“当前时间”作为时间戳进行新提交的过滤器),你真的需要一个未经修改的原始存储库,或那些refs/originals/
引用(任何一个都足以保持“原始副本”)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.