简体   繁体   中英

Is it possible to clean up a remote repo's files with bad commits on GitHub?

Background: I've got something of a nested issue for one of our repos that is remotely hosted on an Enterprise edition of GitHub that my company uses.

I think the easiest way to handle it given how old the repo is would be to somehow remove old hard-committed files that never should have been committed in the first place that are presumably being stored somewhere either directly or by reference. The trick of that it is, I don't want to mess with the history if it can be helped, and I don't know a lot about the more advanced git features, so it's hard to even know what the right question is to ask.

The problem: The repo is taking too long to pull/fetch down via Jenkins, via the GitSCM plugin. It times out after about 10 minutes. This repo has thousands of commits and dozens of tags to keep track of, so I can't arbitrarily set a certain commit as a good point to start from and truncate the rest.

My findings: Trying to do what the GitSCM plugin seems to be doing doesn't cause nearly the extent of problems or time requirements. That said, it is still incredibly slow, just not 10+-minutes-slow, so we should probably clean this up even if the plugin is introducing exacerbated performance issues.

Possible optimizations: I found out that several commits were the addition of mostly DLLs. These DLLs have since been removed via new commits. However, the size of the repo is still hundreds of megabytes versus what is actually being used by the local filesystem. Right now, the master branch is at about 4MB outside the .git folder, which is about 300 MB.

Goal: get rid of as much of that 300 MB as I can without pissing people off by losing history/tags

I've tried numerous solutions from possibly related issues, but I haven't been able to get it where the remote hosted repo is slimmed down to something closer to the actual size used by the filesystem. Some of those questions have been,

Reduce git repository size
How to remove unused objects from a git repository?
Why won't git further reduce the repository size?

After trying out solutions from those questions, I ended up only increasing the size of the repo instead of reducing it, which to be fair I was warned about in one of those questions' answers.

Given the background of this issue, the problem details, and the questions previously referenced, can one accomplish what I'm trying to do on a remote-hosted repo, and if so, what specifically should I run or ask our GHE admins to run if I'm not personally able to do the update?

This ended up causing it to grow:

git reflog expire --all --expire=now
git gc --prune=now --aggressive
git filter-branch --index-filter "git rm --cached --ignore-unmatch *.dll" --prune-empty -- --all
git push origin master

However, after running the first two commands, I only saw the .git folder reduce in size by 40 MB; nowhere near what I was hoping for, which is why I tried the next command in the sequence, which when pushing remotely caused the repo to grow instead of shrink. The object-count went from about 45k to 60k.

The trick of that it is, I don't want to mess with the history if it can be helped,

But you will: a git filter-branch or (easier to use) a BFG repo cleaner will rewrite the history (SHA1s) of the commits of that repo, forcing you to git push --force the end result back to the remote repo.
This is not a big deal, considering the repo is old (ie. not actively maintained anymore), but still has to be taken into account.

The repo is taking too long to pull/fetch down via Jenkins, via the GitSCM plugin.

Jenkins should not be involved at all here: you can clone the repo locally, clean it, and push it back.
Plus, the timeout in Jenkins can be raised.

This ended up causing it to grow:

Those reflog/gc commands are supposed to be used after a filter-branch or BFG, not before.

I am not going to accept my own answer. VonC has done an admirable task of trying to massage the answer within comments to meet my very specific requirements which may not be holding other people back with similar issues -- in addition, VonC did mention using BFG, which is what ended up unblocking me. Getting this to work with only git would have been nice, but since BFG is totally free (and also way faster than git filter-branch ) I can't disregard it as an alternative to handling git issues.

To unblock our remote builds by reducing repo-size within the .git folder, I used the free tool, BFG Repo Cleaner and followed its instructions, exactly. It shrunk the size of the .git folder from the original 300MB size, down to 80MB. Considering this repo had over 7k commits, I'm not going to complain about the .git folder still being large. This operation definitely sped up the process of being able to clone the repo, substantially.

How-To

Full disclosure : some of these steps are directly copied from BFG Repo Cleaner's documentation, which is linked to is step #2. It also assumes you're using Windows, so update the shell syntax as-needed.

  1. Install Java if you don't already have it
  2. Grab the free tool, BFG Repo Cleaner from their site, here , which is also their documentation page
  3. If you don't want to do the exact same operation as me where I delete all file-types of .DLL , check BFG's brief documentation to see about what else is available
  4. Open a command console and perform a shallow-clone using --mirror for your repo, as such:
    git clone --mirror https://github.com/some-big-repo.git
  5. If java.exe isn't in your path, either temporarily add that directory to PATH with Set PATH=%PATH%;C:\\PathToJavaBin , or call it directly, and make sure to update this command of the JAR file's name so the command below matches what is in your filesystem, as such:
    C:\\PathToJavaBin\\java.exe -jar C:\\PathToBFGJar\\bfg.jar --delete-files *.dll some-big-repo.git
  6. run cd some-big-repo.git
  7. run git reflog expire --expire=now --all
  8. run git gc --prune=now --aggressive
  9. run git push

And that was it :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM