简体   繁体   中英

How to read and list a tgz file in python3?

In python 3 (3.6.8) I want to read a gzipped tar file and list its content.

I found this solution which yields an error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Searching for this error in found this suggestion so I tried the following code snippet:

with open(out_file) as fd:
    gzip_fd = gzip.GzipFile(fileobj=fd)
    tar = tarfile.open(gzip_fd.read())

which yields the same error!

So how to do it right?

Even when looking at the actual documentation here I came up with the following code:

tar = tarfile.open(out_file, "w:gz")
for member in tar.getnames():
   print(tar.extractfile(member).read())

which finally worked without errors - but did not print any content of the tar archive on the screen!

The tar file is well formatted and contains folders and files. (I need to try to share this file)

When you open a file without specifying mode it defaults to reading it as text. You need to open the file as raw byte stream using mode='rb' flag then feed it to gzip reader

with open(out_file, mode='rb') as fd:
    gzip_fd = gzip.GzipFile(fileobj=fd)
    tar = tarfile.open(gzip_fd.read())

The python-archive module (available on pip) could help you:

from archive import extract

file = "you/file.tgz"
try:
    extract(file, "out/%s.raw" % (file), ext=".tgz")
except:
    # could not extract
    pass

Available extensions are (v0.2): '.zip', '.egg', '.jar', '.tar', '.tar.gz', '.tgz', '.tar.bz2', '.tz2'

More info: https://pypi.org/project/python-archive/

Not sure why it did not work before, but the following solution works for me in order to list the files and folders of a gzipped tar archive with python 3.6 :

tar = tarfile.open(filename, "r:gz")
print(tar.getnames())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM