简体   繁体   中英

How to mirror a git repository safely?

I want to mirror some git repositories with a background job. git clone --mirror and git remote update won't preserve objects that were unreferenced with a forced push, but I want to keep those too in case of a hack. Are there any tools to perform safe git mirrors?

Although it would be tedious short of a shell script, if you have access to the remote repository then you could perform:

git fsck --lost-found

This will list the unreferenced commits and you can then create a branch for each one:

git branch <branch-for-commit> <commit>

At this point nothing is unreferenced and the clone won't gc/lose anything. After that, if you chose, you can delete the branches as in Delete Branches

Like this:

git fsck --lost-found | \
  grep dangling | \
  awk '{ system ("git branch" " br-" $3 " " $3); }'

to retain dangling commits and blobs. Then, after the clone,

git branch -a | grep "br-" | xargs git branch -D

After re-reading the question and the provided answers I'm sure I misinterpreted the poster's intent.

Now I think what's needed is enabling the reflog in the destination repository, by setting the core.logAllRefUpdates in it to true , and possibly tuning the associated parameters controlling the reflog expiration policy (grep the git-config manual page for the world "reflog"). This will log all "drastic" refs repositions allowing to rollback forced pushes or the like.

Note that still the only true protecting against, say you say, "hacking", or any other kind of loss is having secured offsite backups, so what I proposed in my first answer, I think, still holds.

This is pretty much of a security issue. If you accidentally pushed a password or something, it is possible to amend your commit and force push the new commit. Afterwards git will not allow any access to the first commit.

If you really need those commits you need access to the file system and use something like rsync to fetch everything. - But be aware that git from time to time is doing some garbage collection, hence old unreferenced commits get actually lost.

I think you could employ two-phase backups:

  1. git fetch <remote> +refs/*:refs/* to update a bare mirror repository.

    git gc to possibly "normalize" its contents.

  2. rdiff-backup the result to another directory, which will contain the last backed up version plus a collection of binary diff files which could be used to reconstruct any previous backup.

This way you could have a versioned history of your backup snapshots. Since rdiff-backups allows to ditch old snapshots, you could story only as many snapshots as you see fit.

The drawback is the waste of disk space:

  • rdiff-backup will make a true copy of the source directory; no file will be shared.
  • AFAIK, Git internals dealing with pack files do not support something like the --rsyncable command-line option of the gzip program, so the deltas generated by rdiff-backup might be big. On the other hand, in a typical Git repo any pack file should only ever be appended, not rewritten, so I might be barking on a wrong tree here.

If you're fetching into a repo you control, there are gc config items to control every aspect of what gets removed and when; git remote update (and everything else) will preserve unreferenced objects as long as you like. Anyone with administrative access can publish unreferenced objects by tagging them.

So to mirror a repo safely, if you have administrative control over it just set up a hook to forward incoming pushes to a backup repo that never expires. Otherwise, keying off receipt of a commit notice will ensure you'll remember everything you ever could have seen.

Methinks one of the commenters has it right: rsync or equivalent is your only friend, irrespective of the fact that it's not an option for eg Github.

See this (admittedly old) thread:

http://kerneltrap.org/mailarchive/git/2007/8/28/256180/thread

Quoting one of the messages:

It has nothing to do with Linux development practices. There are fundamental reasons why we don't fetch hooks:

  • they are not part of the repository; just look at the .gitattributes-in-the-index-but-not-worktree issue to find out why,

  • it is private data, just like the config. The client has no business to read them, let alone fetch them,

  • if you have the hooks on different machines, chances are that you need a mechanism to update the hooks... This naturally suggests putting the hooks into their own branch.

Probably there are way more reasons not to allow such a thing as fetching hooks.

So... Putting the hooks in their own repo is a good solution if you want to deploy them in multiple locations, but besides that there isn't much you can do insofar as I'm aware for all sorts of objects in a repo...

(credits for that link goes to: https://stackoverflow.com/a/6154010/417194 )

For github in particular, there might be workarounds with github-services or web hooks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM