简体   繁体   English

如何从 a.gz 存档中获取原始文件的名称?

[英]How to get the name of the original file from a .gz archive?

I'm writing a utility that takes a.gz archive and checks if it's contents already exist in a specified folder.我正在编写一个实用程序,它采用 a.gz 存档并检查它的内容是否已存在于指定文件夹中。 If they don't, it will extract the archive there.如果他们不这样做,它将在那里提取存档。

The way I planned on doing this is reading the filenames of the files in the.gz archive one by one and checking if such a file already exists in my directory.我计划这样做的方法是一个一个地读取 .gz 存档中文件的文件名,并检查我的目录中是否已经存在这样的文件。 But from what I understand this isn't possible with gzip.但据我了解,这对于 gzip 是不可能的。

Ideally, I'm looking for something like this:理想情况下,我正在寻找这样的东西:

archive = gzipfile.GzipFile(source)

    for i in archive.getmembers():
        if os.path.isfile(destination + sep + i.name) and overwrite:
        ...

Is this possible?这可能吗?

A .gz file is not an archive, it simply is compressed. .gz文件不是存档,它只是被压缩了。 If you have a .tar.gz file, you can use tarfile . 如果您有.tar.gz文件,则可以使用tarfile

While it is true that the .gz file is simply a compressed file, the original file name can be truncated or indeed the compressed .gz file can be renamed. 虽然确实是.gz文件只是一个压缩文件,但原始文件名可以被截断,或者确实可以重命名压缩的.gz文件。 gunzip can be told to provide the original file name using the -N flag and if used with the -l (minus lowercase L) it will tell you the original file name without uncompressing the file. 可以使用-N标志告知gunzip提供原始文件名,如果与-l (减小写L)一起使用,它将在不解压缩文件的情况下告诉您原始文件名。
For example: 例如:

$ gzip sometext.txt
$ mv sometext.txt.gz othertext.gz
$ gunzip -Nl othertext.gz
         compressed        uncompressed  ratio uncompressed_name
                 58                 113  76.1% sometext.txt

You can hack your way through this in python as well. 您也可以在python中通过这种方式来破解。

from subprocess import check_output
size_name = check_output(['gunzip', '-Nlq','othertext.gz'])
size_name = size_name.strip().split("%",1)
print "original filename =",size_name[1].strip()

result: 结果:

original filename = sometext.txt

I do not believe that the python gzip package allows you to access the original file name. 我不相信python gzip包允许您访问原始文件名。
Someone else may know different! 其他人可能知道不同!

Adding to the accepted answer:添加到已接受的答案:

At least CPython's gzip does not expose the file name metadata because it simply discards it as you can see when you check the source code.至少 CPython 的gzip不会公开文件名元数据,因为它只是将其丢弃,正如您在检查源代码时所看到的那样。

However, the gzip file format (specified in RFC 1952 ) or at least its metadata is easy enough to manually parse:但是,gzip 文件格式(在RFC 1952中指定)或至少它的元数据很容易手动解析:

import struct

def getGzipName(path):
    with open(path, 'rb') as file:
        id1, id2, compression, flags, mtime, extraFlags, osId = struct.unpack('<BBBBLBB', file.read(10))
        if id1 != 0x1F or id2 != 0x8B or compression != 0x08:
            return None

        # Extra Field (e.g. used by bgzip to store the length of the compressed block)
        if flags & ( 1 << 2 ) != 0:
            file.read(struct.unpack('<U', file.read(2))[0])

        # File Name Field
        if flags & ( 1 << 3 ) != 0:
            name = b''
            c = file.read(1)
            while c != b'\0':
                name += c
                c = file.read(1)
            return name.decode()

    return None

Note that theoretically gzip could be used as an archive format because it does support storing original file names, which might be used to store paths, and because multiple gzip streams (all with different file names) are allowed to be concatenated to each other.请注意,理论上gzip 可以用作存档格式,因为它确实支持存储原始文件名,这可能用于存储路径,并且因为允许多个 gzip 流(所有文件名都不同)相互连接。 However, not even the gzip tool does support such exotic gzip files not even with the --name option.然而,即使使用--name选项, gzip工具也不支持这种奇异的 gzip 文件。 It will simply concatenate the data of the second gzip stream to the original filename of the first gzip stream.它会简单地将第二个 gzip stream 的数据连接到第一个 gzip stream 的原始文件名。

import tarfile

archive = tarfile.open(source)
for i in archive.getmembers():
    ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用python中的URL从tar.gz存档中提取单个文件 - How to extract a single file from a tar.gz archive with its URL in python 如何在PyKD中获取模块的原始文件名? - How to get the original file name of a module in PyKD? python中,如果提取一个tar.gz文件,如何获取或设置结果文件的名称 - In python, if extract a tar.gz file, how to get or set the name of the result file 如何使用python从archive.is.short链接获取原始URL? - How can I get the original URL from an archive.is short link using python? 如何使用 python 从没有文件名的 URL 下载巨大的 gz 文件(大约 3 GB 大小) - How to download a huge gz file (around 3 GB size) from a URL where there is no file name present using python 获取解压后的 .tar.gz 文件的文件夹名称 - Get folder name of unzipped .tar.gz file 如何在不解压缩内容的情况下查看.tar.gz存档中特定文件的内容? - how to see the content of a particular file in .tar.gz archive without unzipping the contents? 如何从熊猫数据框列表中获取原始变量名称 - How to get the original variable name from a list of pandas dataframes (PY)Spark:如何读取扩展名为“.gz”的“.txt”文件 - (PY)Spark: How to read a ".txt" file with extension name ".gz" 从gz文件获取每一列的唯一值 - Get unique values of every column from a gz file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM