简体   繁体   中英

Git fetch single file from remote repository programatically

I'll say up front that this question is similar in nature to this . There's one key difference that makes this unique: I want to use the raw git protocol (see here and here if you're unfamiliar with the basic pack network protocol).

I'm writing an application using Scala and JGit that will connect to an anonymous git repository. I want to request a single blob (think "/path/to/file.txt" @ "refs/heads/branch1"). Ultimately my goal is to programatically retrieve a single file out of a remote repository. Seems like a pretty useful thing to be able to do.

Anywho, I've been delving into the internals of this protocol. It appears that the basic version of this is "I want these object(s), I have these object(s)" -- and bam, there's a packfile with everything you don't have. The core of my question is this: how do I ask git-upload-packfile for a single object in a non-recursive manner? I'm ok with downloading a single commit object, then asking for the tree, then a subtree, then another subtree, and then finally the blob itself. Speed isn't too important here, mainly I'm trying to save on bandwidth. But it seems that there's simply no way to tell git-upload-packfile, "please only give me the one object I asked for".

Yes, there's the "have" list, which will basically exclude objects from coming down, however that requires a priori knowledge of the contents of a repository (I don't have a local repo, remember). I could generate a list of all possible sha1s and send all of them except for the one I want, but that's beyond ridiculous (time consuming, bandwidth consuming, and a crime against programmers everywhere)

Another possible solution I've been delving into is using git-upload-archive on the remote side instead, although I admit I haven't spent much time looking in to it yet.

I'm more than willing to rewrite JGit if it comes to that, so please don't read this as "how do I make JGit do...". I just want to know if the protocol itself is even capable of this. I feel like there's some wonderfully clever way to abuse the protocol to acheive what I want. Any thoughts?

Answering my own question. I found an acceptable (although barely documented) answer. I had to dig through a LOT of C code to figure this out.

First of all, the above requirements can't be achieved using git-upload-packfile because that's simply not what the program was designed to do. The correct answer as I suspected is git-upload-archive . Sadly the protocol is hardly documented at ALL. So here are my notes on it in case anyone else has similar requirements.

Basically what I'm trying to simulate here (in scala) is the following command:

git archive --format=tar --remote=ssh://dave@ssh.mycompany.com/cornballer.git \
  > master plans/documents/cornballer-blueprint.pdf | tar -x

Except in software, hopefully using JGit. Sadly JGit doesn't (yet) support git archive commands. So here's a very high-level overview of how to add support (I may fork JGit and add this at a later time).

Let's look at the protocol (from Documentation/technical/pack-protocol.txt):

git-proto-request = request-command SP pathname NUL [ host-parameter NUL ]
request-command   = "git-upload-pack" / "git-receive-pack" /
                    "git-upload-archive"   ; case sensitive
pathname          = *( %x01-ff ) ; exclude NUL
host-parameter    = "host=" hostname [ ":" port ]

So part one of the protocol goes something like this:

  1. Establish a transport with the remote (either ssh and then run git-upload-archive or use the anonymous git protocol)
  2. Send git-upload-archive /cornballer.git\\0host=ssh.mycompany.com\\0 (as a packet line)

At this point the connection is established. GIt may return an error if the command isn't supported or if there was any kind of problem. I haven't yet figured out how to check for this.

Next comes the undocumented part. We basically send command line arguments for git-archive over the wire. They're exactly the same as the git-archive command with one exception: they are all prefixed with argument[SPACE] . Each argument is written (at least in the reference implementation) as a separate packet line. So for the above example:

  1. Send argument --format=tar (as a packet line)
  2. Send argument master (as a packet line)
  3. Send argument plans/documents/cornballer-blueprint.pdf (as a packet line)
  4. Send a flush packet ( 0000 )

At this point we've given the remote git-archive process the entire command. Now we read the response. We read one packet line back from the server, which will be one of the following responses:

  1. ACK (meaning success -- ready to send the archive)
  2. NACK [message] -- some kind of error, only found one instance of its use -- "unable to spawn subprocess"
  3. ERR [message] -- an error occurred

If an ACK is sent, it will be followed by a flush packet ( 0000 ) and then the raw tar data. At this point you repeatedly read packet lines coming in on sideband #1 (the main data channel). When you reach a flush packet, you stop reading. Pretty simple.

So now you have the remote file, but what if you wanted to do some kind of clever caching? One reason I was so gung-ho on using git-upload-packfile is that it would let me record the commit ID and thus cache it locally and only refresh as needed. A tar file doesn't tell us that info right? Wrong!

From the man page of git-archive:

Additionally the commit ID is stored in a global extended pax header if the tar format is used; it can be extracted using git get-tar-commit-id. In ZIP files it is stored as a file comment.

Well that's great news! That's literally everything I wanted. In case you're wondering what the header looks like, here's a sample (no I'm not going to dissect pax headers):

pax_global_header00006660000000000000000000000064121002672560014513gustar00rootroot0000000000000052 comment=326756f834865880c9832b64238e7665632e9b67

So from my perspective, I simply need to set up a pipeline to automatically run the above steps, run it through an untar step (programatically) to perform the desired "fetch a single file from git" functionality.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM