[英]How to get the name of the original file from a .gz archive?
I'm writing a utility that takes a.gz archive and checks if it's contents already exist in a specified folder.我正在编写一个实用程序,它采用 a.gz 存档并检查它的内容是否已存在于指定文件夹中。 If they don't, it will extract the archive there.
如果他们不这样做,它将在那里提取存档。
The way I planned on doing this is reading the filenames of the files in the.gz archive one by one and checking if such a file already exists in my directory.我计划这样做的方法是一个一个地读取 .gz 存档中文件的文件名,并检查我的目录中是否已经存在这样的文件。 But from what I understand this isn't possible with gzip.
但据我了解,这对于 gzip 是不可能的。
Ideally, I'm looking for something like this:理想情况下,我正在寻找这样的东西:
archive = gzipfile.GzipFile(source)
for i in archive.getmembers():
if os.path.isfile(destination + sep + i.name) and overwrite:
...
Is this possible?这可能吗?
While it is true that the .gz file is simply a compressed file, the original file name can be truncated or indeed the compressed .gz file can be renamed. 虽然确实是.gz文件只是一个压缩文件,但原始文件名可以被截断,或者确实可以重命名压缩的.gz文件。
gunzip
can be told to provide the original file name using the -N
flag and if used with the -l
(minus lowercase L) it will tell you the original file name without uncompressing the file. 可以使用
-N
标志告知gunzip
提供原始文件名,如果与-l
(减小写L)一起使用,它将在不解压缩文件的情况下告诉您原始文件名。
For example: 例如:
$ gzip sometext.txt
$ mv sometext.txt.gz othertext.gz
$ gunzip -Nl othertext.gz
compressed uncompressed ratio uncompressed_name
58 113 76.1% sometext.txt
You can hack your way through this in python as well. 您也可以在python中通过这种方式来破解。
from subprocess import check_output
size_name = check_output(['gunzip', '-Nlq','othertext.gz'])
size_name = size_name.strip().split("%",1)
print "original filename =",size_name[1].strip()
result: 结果:
original filename = sometext.txt
I do not believe that the python gzip package allows you to access the original file name. 我不相信python gzip包允许您访问原始文件名。
Someone else may know different! 其他人可能知道不同!
Adding to the accepted answer:添加到已接受的答案:
At least CPython's gzip
does not expose the file name metadata because it simply discards it as you can see when you check the source code.至少 CPython 的
gzip
不会公开文件名元数据,因为它只是将其丢弃,正如您在检查源代码时所看到的那样。
However, the gzip file format (specified in RFC 1952 ) or at least its metadata is easy enough to manually parse:但是,gzip 文件格式(在RFC 1952中指定)或至少它的元数据很容易手动解析:
import struct
def getGzipName(path):
with open(path, 'rb') as file:
id1, id2, compression, flags, mtime, extraFlags, osId = struct.unpack('<BBBBLBB', file.read(10))
if id1 != 0x1F or id2 != 0x8B or compression != 0x08:
return None
# Extra Field (e.g. used by bgzip to store the length of the compressed block)
if flags & ( 1 << 2 ) != 0:
file.read(struct.unpack('<U', file.read(2))[0])
# File Name Field
if flags & ( 1 << 3 ) != 0:
name = b''
c = file.read(1)
while c != b'\0':
name += c
c = file.read(1)
return name.decode()
return None
Note that theoretically gzip could be used as an archive format because it does support storing original file names, which might be used to store paths, and because multiple gzip streams (all with different file names) are allowed to be concatenated to each other.请注意,理论上gzip 可以用作存档格式,因为它确实支持存储原始文件名,这可能用于存储路径,并且因为允许多个 gzip 流(所有文件名都不同)相互连接。 However, not even the
gzip
tool does support such exotic gzip files not even with the --name
option.然而,即使使用
--name
选项, gzip
工具也不支持这种奇异的 gzip 文件。 It will simply concatenate the data of the second gzip stream to the original filename of the first gzip stream.它会简单地将第二个 gzip stream 的数据连接到第一个 gzip stream 的原始文件名。
import tarfile
archive = tarfile.open(source)
for i in archive.getmembers():
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.