简体   繁体   中英

Read .gz files inside .tar files without extracting

I have a.tar file that contains many.gz files inside a folder. Each of these gz files contain a.txt file. Other stackoverflow questions related to this problem are aimed at extracting the files.

I am trying to iteratively read the content of each.txt file without extracting them, because the.tar is large.

First I read the contents of the.tar file:

import tarfile
tar = tarfile.open("FILE.tar")
tar.getmembers()

Or in Unix:

tar xvf file.tar -O

Then I tried using the tarfile extractfile method, but I'm getting an error: "module 'tarfile' has no attribute 'extractfile'". Besides, I'm not even sure that is the right method.

import gzip
for member in tar.getmembers():
    m = tarfile.extractfile(member)
    file_contents = gzip.GzipFile(fileobj=m).read()

If you want to create an example file to simulate the original file:

$ mkdir directory
$ touch directory/file1.txt.gz directory/file2.txt.gz directory/file3.txt.gz
$ tar -c -f file.tar directory

Edit: I am writing a Python script, but Unix lines would also work.

Here's unix line / bash command:

To prepare the files:

$ git clone https://github.com/githubtraining/hellogitworld.git
$ cd hellogitworld
$ gzip *
$ ls
build.gradle.gz  fix.txt.gz  pom.xml.gz  README.txt.gz  resources  runme.sh.gz  src
$ cd ..
$ tar -cf hellogitworld.tar hellogitworld/

Here is how to view its readme:

$ tar -Oxf hellogitworld.tar hellogitworld/README.txt.gz | zcat

Result:

This is a sample project students can use during Matthew's Git class.

Here is an addition by me

We can have a bit of fun with this repo, knowing that we can always reset it to a known good state.  We can apply labels, and branch, then add new code and merge it in to the master branch.

As a quick reminder, this came from one of three locations in either SSH, Git, or HTTPS format:

* git@github.com:matthewmccullough/hellogitworld.git
* git://github.com/matthewmccullough/hellogitworld.git
* https://matthewmccullough@github.com/matthewmccullough/hellogitworld.git

We can, as an example effort, even modify this README and change it as if it were source code for the purposes of the class.

This demo also includes an image with changes on a branch for examination of image diff on GitHub.

Note that I'm not associated with those git repository.

Explanation for tar:

  • flag -x = extract
  • flag -O = don't write the file to filesystem but write to STDOUT
  • flag -f = specify a file

Then the rest is just piping the result to zcat to see the uncompressed plaintext in STDOUT

You need to use tar.extractfile(member) instead of tarfile.extractfile(member) . tarfile is the class , and doesn't know about the tar file you opened. tar is the tarfile object , which references the.tar file you opened.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM