简体   繁体   English

减少Bitbucket上git存储库的大小

[英]Reduce size of git repository on Bitbucket

After few months of (commit & push) for my project, the size of the repository gets increased gradually on Bitbucket! 经过几个月(提交和推送)我的项目,存储库的大小在Bitbucket上逐渐增加! it's about 1 GB, I tried to remove some databases folders that are not important to be added. 它大约是1 GB,我试图删除一些不重要的数据库文件夹。 After searching I found most of suggestions is proposing : 搜索后我发现大多数建议都是建议:

git filter-branch -f --tree-filter 'rm -rf folder/subfolder' HEAD

After removing few folders I push the change to the repository by -- force, as 删除几个文件夹后,我将更改推送到存储库 - 强制,如

git push origin master --force

I finally found that the repository gets larger every time I use those commands !!. 我终于发现每次使用这些命令时存储库都会变大!! Visibly, the repository gets larger 2.5 GB!! 可见,存储库变大了2.5 GB!

Any suggestion please ? 有什么建议吗?

EDIT Depending on the suggestion below, I tried the following commands 编辑根据下面的建议,我尝试了以下命令
(for all large files) (适用于所有大文件)

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" --tag-name-filter cat -- --all git filter-branch --index-filter“git rm -rf --cached --ignore-unmatch $ files”--tag-name-filter cat - --all

(remove the temporary history git-filter-branch otherwise leaves behind for a long time) (删除临时历史git-filter-branch,否则长时间留下)

rm -rf .git/refs/original/ rm -rf .git / refs / original /

git reflog expire --all
git gc --aggressive --prune

But the folder .git/objects has still a big size !!!! 但文件夹.git / objects仍然很大!!!!

OK, given your answer to your comment, we can now say what happened. 好的,鉴于您对评论的回答,我们现在可以说出发生了什么。

What git filter-branch does is to copy (some or all of) your commits to new ones, then update the references. git filter-branch作用是您的提交(部分或全部) 复制到新的提交,然后更新引用。 This means your repository gets bigger (not smaller), at least initially. 这意味着您的存储库变得更大 (不小),至少在最初阶段。

The commits that are copied are those reachable via the references given. 复制的提交是通过给定的引用可以访问的提交。 In this case, the reference you gave is HEAD (which git turns into "your current branch", probably master , but whatever your current branch was at the time of the filter-branch command). 在这种情况下,你给出的引用是HEAD (git变成“你当前的分支”,可能是master ,但无论你当前的分支是在filter-branch命令时)。 If (and only if) the new copy is precisely, bit-for-bit identical to the original, then it actually is the original and there is no actual copy made (the original is reused instead). 如果(并且仅当)新副本精确地,与原始副本一点一点地相同,那么它实际上原始副本并且没有实际副本(原件被重新使用)。 However, as soon as you make any change—such as removing folder/subfolder , from that point on these really are copies. 但是,只要您进行任何更改 - 例如删除folder/subfolder ,从这一点开始就是这些副本。

The copied stuff is, in this case, smaller, because you've removed some items. 在这种情况下,复制的东西更小,因为你删除了一些项目。 (It's generally not very much smaller since git compresses items pretty well.) But you're still adding more stuff to the repository: new commits, which refer to new trees, which—fortunately—refer to the same old blobs (file objects) as before, just slightly fewer of them this time (the objects for the folder/subfolder files are still in the repository, but the copied commits and tree-objects no longer refer to them). (它通常不是很小,因为git很好地压缩项目。)但是你仍然在向存储库添加更多东西:新提交,它引用新树,幸运的是 - 引用相同的旧blob(文件对象)和以前一样,这次只是少了一些( folder/subfolder文件folder/subfolder文件的对象仍在存储库中,但复制的提交和树对象不再引用它们)。

Pictorially, at this point in the filter-branch process, we now have both the old commits: 从图形上看,在filter-branch过程的这一点上,我们现在都有旧的提交:

R--o--o---o--o   <-- master
    \    /
     o--o        <-- feature

and the new ones (I'll assume folder/subfolder appeared in the original root commit R so that we have a copy R' here): 和新的(我假设folder/subfolder出现在原始的根提交R所以我们在这里有一个副本R' ):

R'-o'-o'--o'-o'
    \    /
     o'-o'

What filter-branch does now, at the end of the copying process, is re-point some references (branch and tag names, mainly). 在复制过程结束时, filter-branch现在做了什么,重新指向一些引用(主要是分支和标记名称)。 The ones it re-points are the ones you tell it to, by mentioning them as what the documentation calls "positive references". 它重新指出的那些是你告诉它的那些,通过提及它们作为文档所谓的“积极参考”。 In this case, if you were on master (ie, HEAD was another name for master ), the single positive reference you gave is master ... so that's all filter-branch re-points. 在这种情况下,如果你在master (即HEADmaster另一个名字),你给出的单个正参考是master ...所以这是所有 filter-branch重新点。 It also makes backup references whose name starts with refs/original/ . 它还会生成名称以refs/original/开头的备份引用。 This means you now have the following commits: 这意味着您现在拥有以下提交:

R--o--o---o--o   <-- refs/original/refs/heads/master
    \    /
     o--o        <-- feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'

Note that feature still points to all the old (not-copied) commits, so that even if / after you get rid of any refs/original/ references, git will retain all the still-referenced commits across any garbage-collect activity, giving: 请注意,该feature仍然指向所有旧的 (未复制的)提交,因此即使在您删除任何refs/original/ references之后,git也将保留所有仍然引用的任何垃圾收集活动的提交, :

R--o
    \
     o--o        <-- feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'

To get filter-branch to update all the references, you need to name them all. 要使filter-branch更新所有引用,您需要将它们全部命名。 An easy way to do that is to use --all , which quite literally names all references. 一个简单的方法是使用--all ,它完全命名所有引用。 In this case, the initial "after" picture looks like this instead: 在这种情况下,最初的“之后”图片看起来像这样:

R--o--o---o--o   <-- refs/original/refs/heads/master
    \    /
     o--o        <-- refs/original/refs/heads/feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'       <-- feature

Now if you erase all the refs/original/ references, all the old commits become unreferenced and can get garbage-collected. 现在,如果删除所有refs/original/引用,则所有旧提交都将被取消引用,并且可以进行垃圾回收。 Well, that is, they do unless there are tags pointing to them. 嗯,就是说, 除非有标签指向它们, 否则它们会这样做

For tag references, filter-branch only updates them in any way if you supply a --tag-name-filter . 对于标记引用, filter-branch仅在您提供--tag-name-filter以任何方式更新它们。 Usually you want --tag-name-filter cat , which keeps the tag names unchanged, but makes filter-branch point them to the newly copied commits. 通常你需要--tag-name-filter cat ,它保持标签名称不变,但是使filter-branch将它们指向新复制的提交。 That way you don't hang on to the old commits: the whole point of the exercise is to make everything use the new copies, and throw away the old copies, so that the big-file objects can be garbage-collected. 这样你就不会挂起旧的提交了:练习的重点是让一切都使用新的副本,然后丢弃旧的副本,这样大文件对象就可以被垃圾收集了。


Putting this all together, instead of: 把这一切放在一起,而不是:

git filter-branch -f --tree-filter 'rm -rf folder/subfolder'

you can use: 您可以使用:

git filter-branch -f --tree-filter 'rm -rf folder/subfolder' \
    --tag-name-filter cat -- --all

(You don't need the backslash-newline sequence; I put that in just to make the line fit better on stackoverflow. Note that --tree-filter is very slow: for this particular case it is much faster to use --index-filter . The index filter command here would be git rm --cached --ignore-unmatch -r folder/subfolder .) (你不需要反斜杠 - 换行符序列;我把它放在只是为了使行更符合stackoverflow。请注意--tree-filter非常慢:对于这种特殊情况,它使用--index-filter要快得多--index-filter 。这里的索引过滤器命令是git rm --cached --ignore-unmatch -r folder/subfolder 。)

Note also that you need to do all this on (a copy of) the original repository (you did keep a backup, right?). 另请注意,您需要在原始存储库(副本)上执行所有这些操作(您确实保留了备份,对吧?)。 (If you did not keep a backup, the refs/originals/ may be your salvation.) (如果你没有备份, refs/originals/可能是你的救赎。)


Edit: OK, so you did some filter-branch -ing, and you did something that deleted any refs/originals/ . 编辑:好的,所以你做了一些filter-branch -ing,你做了一些删除任何refs/originals/ (In my experiment on a temp repo, running git filter-branch on HEAD used whatever branch I was on as the branch that was re-pointed, and made an "originals" copy of the previous value.) There are no backups of the repository. (在我对temp repo的实验中,在HEAD上运行git filter-branch使用我所在的任何分支作为重新指向的分支,并创建了前一个值的“原始”副本。)没有备份库。 Now what? 怎么办?

Well, as a first step, make a backup now . 那么,作为第一步,立即进行备份 This way, if things get any worse, you can at least get back to "only slightly bad". 这样,如果事情变得更糟,你至少可以回到“只是稍微糟糕”。 To make a backup of the repo, you can simply clone it (or: clone it, then call the original the "backup", then begin working on the clone). 要备份repo,您​​可以简单地克隆它(或者:克隆它,然后将原始文件称为“backup”,然后开始处理克隆)。 For future reference, since git filter-branch can be quite destructive, it's usually wise to start by doing this backing-up process. 为了将来参考,由于git filter-branch可能具有很强的破坏性,因此通常可以从这个备份过程开始。 (Also, I'll note that a clone on bitbucket, when not yet push ed-to, would serve. Unfortunately you did a push . Perhaps bitbucket can retrieve an earlier version of the repository from some backups or snapshots of their own.) (另外,我会注意到bitbucket上的一个克隆,当还没有push ed-to时,会服务。不幸的是你做了push 。也许bitbucket可以从他们自己的一些备份或快照中检索存储库的早期版本。)

Next, let's note a peculiarity of commits and their SHA-1 "true names", that I mentioned earlier. 接下来,让我们注意一下提交的特性及其SHA-1“真实姓名”,我之前提到过。 The SHA-1 name of a commit is a cryptographic checksum of its contents. 提交的SHA-1名称是其内容的加密校验和。 Let's take a look at a sample commit in git's own source tree (trimmed down a bit just for length, and email addresses whacked to foil harvesters): 让我们看看git自己的源代码树中的一个示例提交(只是为了长度而修剪了一下,并且电子邮件地址被打到了收割机):

$ git cat-file -p 5de7f500c13c8158696a68d86da1030313ddaf69
tree 73eee5d136d2b00c623c3fceceffab85c9e9b47e
parent c4ad00f8ccb59a0ae0735e8e32b203d4bd835616
author Jeff King <peff peff.net> 1405233728 -0400
committer Junio C Hamano <gitster pobox.com> 1406567673 -0700

alloc: factor out commit index

We keep a static counter to set the commit index on newly
allocated objects. However, since we also need to set the
[snip]

Here, we can see that the contents of this commit (whose "true name" is 5de7f50... ) start with a tree and another SHA-1, a parent and another SHA-1, an author and committer , then a blank line followed by the commit message text. 在这里,我们可以看到此提交的内容(其“真实名称”为5de7f50... )以tree和另一个SHA-1, parent和另一个SHA-1, authorcommitter ,然后是空行然后是提交消息文本。

If you look at a tree you'll see that it contains the "true names" (SHA-1 values) of sub-trees (sub-directories) and file objects ("blobs", in git terminology) along with their modes—really, just whether the blob should have execute permission set, or not—and their names within the directory. 如果你看一tree你会看到它包含子树(子目录)的“真实名字”(SHA-1值)和文件对象(git术语中的“blob”)及其模式 -实际上,只是blob是否应该具有执行权限集,以及它们在目录中的名称。 For instance, the first line of the above tree is: 例如,上面tree的第一行是:

100644 blob 5e98806c6cc246acef5f539ae191710a0c06ad3f    .gitattributes

which means that the repository object 5e98806... should be extracted, put in a file named .gitattributes , and set non-executable. 这意味着应该提取存储库对象5e98806... ,放入名为.gitattributes的文件中,并设置不可执行文件。

If I ask git to make a new commit, and set up, as its contents: 如果我要求git进行新的提交,并设置,作为其内容:

  • the same tree ( 73eee5d... ) 同一棵树( 73eee5d...
  • the same parent ( c4ad00f... ) 同一个父母( c4ad00f...
  • the same author and committer 同一作者和提交人
  • and the same blank line and message 和相同的空白行和消息

then when I get git to write that commit to the repository, it will generate the same "true name" 5de7f50... . 然后,当我得到git将该提交写入存储库时,它将生成相同的“真实名称” 5de7f50... In other words, it literally is the same commit: it's already in the repository and git commit-tree will just give me back the existing ID. 换句话说,它实际上是相同的提交:它已经在存储库中, git commit-tree将只返回现有的ID。 While it's a bit tricky to set all this up, that's exactly what git filter-branch ends up doing: it extracts the original commit, applies your filters, sets up everything, and then does a git commit-tree . 虽然设置这一切有点棘手,但这正是git filter-branch最终要做的事情:它提取原始提交,应用过滤器,设置所有内容,然后执行git commit-tree

What this means for you 这对你意味着什么

On your original repo, you ran a git filter-branch command that copied commits to new, modified commits (with different tree s and hence, at some point, different true names which led to different parent IDs in subsequent commits, and so on). 在您的原始仓库中,您运行了一个git filter-branch命令,该命令将提交复制到新的,已修改的提交(使用不同的tree ,因此,在某些时候,不同的真实名称会导致后续提交中的不同父ID,依此类推) 。 However, if you copy those copied commits by applying a filter that this time does nothing , then the new tree objects will be the same as the old ones. 但是,如果通过应用此次不执行任何操作的过滤器来复制这些复制的提交,则新的tree对象将与旧的tree对象相同 If the new parent is the same, and the author, committer, and message also all remain the same, the new commit-ID for the copy will be the same as the old ID. 如果新父级是相同的,并且作者,提交者和消息也保持不变,则副本的新提交ID将与旧ID 相同

That is, these new copies are not copies after all, they're just the originals again! 也就是说,这些副本毕竟不是副本,它们只是原件!

Any other commits—those that were not copied in the first pass—do get copied, and hence have different IDs. 任何其他提交 - 在第一次传递中复制的提交都会被复制,因此具有不同的ID。

Here's where things get tricky. 事情变得棘手。

If your current repository looks like this (graphically speaking): 如果您当前的存储库看起来像这样 (从图形上讲):

R--o--o---o--o   <-- xxx [needs a name so that filter-branch will process it]
    \    /
     o--o        <-- feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'

and we apply a new filter-branch to all references (or even "all but master ") in such a way that it generates the same trees this time, it will copy R again and the new tree will match that for R' , so the copy will actually be R' . 我们将这个新的filter-branch所有引用(甚至是“除了master ”之外的所有引用),这次它会生成相同的树,它会再次复制R并且新树将与R'匹配,所以副本实际上 R' Then it will copy the first post- R node, make the same changes, and the copy will actually be the first post- R' , o' node. 然后它将复制第一个后R节点,进行相同的更改,副本实际上是第一个后R'o'节点。 This will repeat for all nodes, possibly even including R' and all the o' s. 这将重复所有节点,甚至可能包括R'和所有o' If filter-branch copies R' , the resulting copy will just be R' again, though, because "remove nonexistent directory" makes no change: our filter does nothing to these particular commits. 如果filter-branch副本R' ,则生成的副本将再次成为R' ,因为“删除不存在的目录”不做任何更改:我们的过滤器对这些特定提交不执行任何操作。

Finally, filter-branch will move the labels, leaving the refs/originals/ versions behind: 最后,filter-branch将移动标签,留下refs/originals/ versions:

R--o--o---o--o   <-- refs/originals/refs/xxx
    \    /
     o--o        <-- refs/originals/refs/feature

R'-o'-o'--o'-o'  <-- master, xxx
    \    /
     o'-o'       <-- feature

This is, in fact, the desired outcome. 事实上,这是理想的结果。

What if the repository looks more like this? 如果存储库看起来更像这样怎么办? That is, what if there is no xxx or similar label pointing to the original (pre-filtering) master , so that you have this: 也就是说,如果没有xxx或类似的标签指向原始(预过滤) master ,那么你有这个:

R--o
    \
     o--o        <-- feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'

The filter-branch script will still copy R and the result will still be R' . filter-branch脚本仍然会复制R ,结果仍然是R' Then it will copy the first o node and the result will still be the first o' node, and so on. 然后,它会复制第一o节点,结果依然会是第一个o'点,依此类推。 It won't copy the now-deleted nodes, but it won't have to: we already have those, reachable via the branch-name master . 它不会复制现在删除的节点,但它不必:我们已经拥有那些,可通过branch-name master As before, filter-branch may copy R' and the various o' nodes, but this is OK, as the filter will do nothing so that the copies are really just the originals after all. 和以前一样, filter-branch可以复制R'和各种o'节点,但这没关系,因为过滤器什么也不做,所以副本实际上只是原件。

Last, filter-branch will, as usual, update the references: 最后, filter-branch将像往常一样更新引用:

R--o
    \
     o--o        <-- refs/originals/refs/feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'       <-- feature

The key that makes this all work is that the filter leaves already-modified commits untouched, so that their second "copies" are just the first-copies again. 使这一切工作的关键是过滤器保留已修改的提交不变,因此它们的第二个“副本”再次只是第一个副本。 1 1

Once everything is done, you can do the same shrinking described in the git filter-branch documentation to ditch the refs/originals/ names and garbage-collect the now-unreferenced objects. 完成所有操作后,您可以执行git filter-branch文档中描述的相同收缩,以抛弃refs/originals/ names并垃圾收集now-unreferenced对象。


1 If you had been using a filter that is not as easily repeated (eg, one that makes new commits with "the current time" as their time-stamps), you would really need an untouched original repository, or those refs/originals/ references (either one would suffice to keep an "original copy" around). 1如果你一直在使用一个不那么容易重复的过滤器(例如,以“当前时间”作为时间戳进行新提交的过滤器),你真的需要一个未经修改的原始存储库,或那些refs/originals/引用(任何一个都足以保持“原始副本”)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM