简体   繁体   中英

Difference in size of a local repo and bitbucket

I have a local repo. I checked the size of .git folder with du -csh <foldername> command. It is 168 mb. I pushed it to my bitbucket repo. I checked the size of the repo for download is just 134 mb.

How is this possible?

First, let's address whole repository size. (Jump to the second header section to skip this part.)

In general, "pure server" repositories are what Git calls bare repositories, which are repositories with no work-tree.

Remember that in any Git repository, 1 you have:

  • every commit ever made, stored in a Git-only form, plus
  • every file associated with these commits, also stored in a Git-only form, plus
  • some miscellaneous overhead data (tags, trees, reference names, "info", hooks, et cetera).

None of these 2 have the form of "files you work with normally on your computer", so if you plan ever to do anything with a commit, other than to send it to another Git, you probably need a work-tree. The --bare repositories on the servers mostly just transfer commits to other Gits (receive and send), so it's a waste of space, and actually counterproductive, to keep a working copy of a current commit.

Since servers omit the working copy, you should generally expect server-side bare repositories to be smaller than client-side non-bare repositories. Your observed result should therefore be entirely unsurprising. What is surprising is that sometimes, the server size repository is bigger . There are many possible reasons for this, the most likely being that the compaction / garbage-collection code has not yet run on the server. In some cases you may need to get help from whoever runs the server. See, eg, How to reduce git repo size on Bitbucket? The complete details rapidly get deep into the weeds of pack file formats, delta compression windows, alternate object directories—sites like GitHub make extnsive use of the latter to keep forks from taking much space—and so on.


1 This deliberately ignores shallow or single-branch clones, which truncate history at specified places and hence omit some or many commits and files.

2 Actually, many of Git's internal files are plain-text, but many are not, and in any case you should generally use what Git calls the plumbing commands to manipulate them, if you are going to write your own code to work with Git. Using the provided API—the plumbing commands—insulates you from future changes intended to make Git work better, faster, etc.


But your .git is just the bare repository

You compared your .git file to their download. Neither of these is or has a work-tree, so why was your .git directory bigger?

Now we have to get into those weeds, at least a little bit.

The first thing to know is that Git has two forms for every Git object: each commit, each "blob" (file), and each annotated tag and tree. One of these is a loose format, which is merely zlib-deflated. The other is in a pack file , which is more compressed.

When you work in a Git repository, you create new loose objects. Git eventually decides that there are too many loose objects taking up too much space, and packs them. That makes them slower to retrieve—they have to be found and unpacked, instead of just gathered directly and re-inflated—but now they take less space.

Second, every time you do stuff in Git, you add new objects. Some of these never get saved permanently. These (usually loose) objects are what Git calls unreferenced: they were made with the intent to save them, at least for a while and maybe permanently, but then they turned out to be unnecessary, so they have just been dropped on the floor.

Besides this, every time you rebase commits, you are actually copying them, then abandoning the originals. But Git keeps the originals around for at least 30 days by default, in case you change your mind and want them back. It uses Git's reflogs to do this.

This is where Git's "garbage collector", git gc , comes in. The garbage collector—the Grim Reaper of Git, or the Grim Collector perhaps—has a number of jobs, including to figure out what's aged out of the reflogs and should be thrown out. This may make more objects become unreferenced (in addition to any that were created but then turned out to be unneeded after all), so it next finds unreferenced loose objects and "prunes" them. Last, it takes care of packing loose objects into the smaller (but slower to access) pack file format.

The garbage collector gets run automatically for you when needed; you should not ever have to run it manually. If you do have to run it manually, that's indicative of a bug of sorts in Git (I have read of some cases of this, with scripts that overload the loose object auto-pruning). Note that this normally leaves unreferenced loose objects around for at least 14 days, though, in case something is still working on making them referenced.

[Edit to add last two items I should have mentioned earlier:] Servers generally run git gc to pack and clean-up after each push; and the downloadable version is sometimes re -packaged on the spot to make it as small as possible, or at least, as small as "automatically possible" (sometimes you can make pack files even smaller by tuning the gc parameters, although when I first experimented with Git I kept making them bigger instead :-) ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM