简体   繁体   中英

What is the difference between cloning a git repo and doing a pull on an empty repo; in regards to files copied from the remote?

I know others have asked a similar question; but I am looking for gritty details here. I am very familiar with git, so I am not looking for why i would clone vs pull, or how to use different git workflows or anything. I am more interested in the underlying plumbing.

The Scenario

I have inherited a repo that is enormous (400MB). And have been tasked with cleaning it up.

I have used git filter-branch and git gc --aggressive to remove all references to any large files that don't need to be in the repo. These commands have worked swimmingly. However, there is some interesting behavior that I would like to understand.

The Problems

  • The local repo file size hasnt changed much. Usually only dropping about 50MB (even after removing an installer that was 300MB). So perhaps git is holding onto some history somewhere? Doing a little investigating, I found out that my local does have some left-over stuff.. That's fine, I will just push the new small repo to the remote, then clone a new one with the smaller set of changes.
  • After a git push --all --force <remote> ; I blew everything away and did a fresh git clone <remote> ; hoping that only the filtered update would get pulled down. Unfortunately, the size was still around 300MB. And listing out git ls-tree still shows blobs with the large files that I had removed.
  • My last try was to blow away my local again. Then create an empty local ( git init ); then git remote add <url> , then git pull <remote> . Note: this is the same remote as above. The remote with the filter-branch stuff pushed up to it. This time, the pull only pulled down the small set of changes. The entire repo was much smaller. Around 37MB!! That's what I was looking for!

So as you can see; git clone is doing a lot more than just pulling down code changes on master and setting up remote tracking branches. What more is it doing? Why is it doing it? And how do I completely clean up my remote such that a clone will result in the smaller file size?

You are already on the right track there.

The difference is the type of data that is transferred in those two operations. Git pull only pulls the data (files, documents, etc.) that are stored in the central repository. Pull therefore requires an already existing and working connection between your remote and local repositories. Git clone, however, copies the whole repository. Not just data, but also metadata, the history (branching, commits, authors, etc.) and a bunch of configuration files that contain the file structure, rights, hierarchies, settings, data management information and more. So, this command basically builds up a whole new repository. Also, git clone creates a remote connection named "origin", which refers to the original repository using git refs...

If you want to start from scratch, I would suggest that you save your files/ code and then delete the remote repository. Then, create a new one, create the necessary branches, upload all the files and clone the repo. Otherwise, if you wish to keep the repo, make sure to delete unnecessary branches, execute some merges if possible and delete hidden backup and history files. The following link shows different ways of efficiently decreasing the size of a repository.

https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html

I hope, I could help. Sven

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM