[英]How to tell if a file is gzip compressed?
I have a Python program which is going to take text files as input.我有一个 Python 程序,它将把文本文件作为输入。 However, some of these files may be gzip compressed.但是,其中一些文件可能是 gzip 压缩的。
Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?是否有跨平台的、可从 Python 使用的方式来确定文件是否是 gzip 压缩的?
Is the following reliable or could an ordinary text file 'accidentally' look gzip-like enough for me to get false positives?以下是可靠的还是普通的文本文件“不小心”看起来像 gzip 一样足以让我得到误报?
try:
gzip.GzipFile(filename, 'r')
# compressed
# ...
except:
# not compressed
# ...
The magic number for gzip compressed files is 1f 8b
. gzip 压缩文件的幻数是1f 8b
。 Although testing for this is not 100% reliable, it is highly unlikely that "ordinary text files" start with those two bytes—in UTF-8 it's not even legal.尽管对此进行的测试不是 100% 可靠,但“普通文本文件”极不可能以这两个字节开头——在 UTF-8 中它甚至是不合法的。
Usually gzip compressed files sport the suffix .gz
though.不过,通常 gzip 压缩文件的后缀是.gz
。 Even gzip(1)
itself won't unpack files without it unless you --force
it to.甚至gzip(1)
本身也不会在没有它的情况下解压缩文件,除非您--force
它。 You could conceivably use that, but you'd still have to deal with a possible IOError (which you have to in any case).您可以想象使用它,但您仍然必须处理可能的 IOError (无论如何您都必须这样做)。
One problem with your approach is, that gzip.GzipFile()
will not throw an exception if you feed it an uncompressed file.您的方法的一个问题是,如果您提供一个未压缩的文件, gzip.GzipFile()
不会引发异常。 Only a later read()
will.只有稍后read()
才会。 This means, that you would probably have to implement some of your program logic twice.这意味着,您可能必须两次实现某些程序逻辑。 Ugly.丑。
Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?是否有跨平台的、可从 Python 使用的方式来确定文件是否是 gzip 压缩的?
The accepted answer explains how one can detect a gzip compressed file in general: test if the first two bytes are 1f 8b
.接受的答案解释了一般如何检测 gzip 压缩文件:测试前两个字节是否为1f 8b
。 However it does not show how to implement it in Python.但是它没有展示如何在 Python 中实现它。
Here is one way:这是一种方法:
def is_gz_file(filepath):
with open(filepath, 'rb') as test_f:
return test_f.read(2) == b'\x1f\x8b'
Testing the magic number of a gzip file is the only reliable way to go.测试 gzip 文件的幻数是唯一可靠的方法。 However, as of python3.7 there is no need to mess with comparing bytes yourself anymore.但是,从 python3.7 开始,不再需要自己比较字节。 The gzip module will compare the bytes for you and raise an exception if they do not match! gzip 模块将为您比较字节,如果不匹配则引发异常!
As of python3.7, this works从python3.7开始,这有效
import gzip
with gzip.open(input_file, 'r') as fh:
try:
fh.read(1)
except OSError:
print('input_file is not a valid gzip file by OSError')
As of python3.8, this also works:从 python3.8 开始,这也有效:
import gzip
with gzip.open(input_file, 'r') as fh:
try:
fh.read(1)
except gzip.BadGzipFile:
print('input_file is not a valid gzip file by BadGzipFile')
gzip
itself will raise an OSError
if it's not a gzipped file. gzip
本身将引发OSError
如果它不是一个压缩文件。
>>> with gzip.open('README.md', 'rb') as f:
... f.read()
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 276, in read
return self._buffer.read(size)
File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 463, in read
if not self._read_gzip_header():
File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'# ')
Can combine this approach with some others to increase confidence, such as checking the mimetype or looking for a magic number in the file header (see other answers for an example) and checking the extension.可以将此方法与其他一些方法结合使用以增加信心,例如检查 mimetype 或在文件头中查找幻数(请参阅其他答案的示例)并检查扩展名。
import pathlib
if '.gz' in pathlib.Path(filepath).suffixes:
# some more inexpensive checks until confident we can attempt to decompress
# ...
try ...
...
except OSError as e:
...
Doesn't seem to work well in python3...在python3中似乎不太好用...
import mimetypes
filename = "./datasets/test"
def file_type(filename):
type = mimetypes.guess_type(filename)
return type
print(file_type(filename))
returns (None, None) But from the unix command "File"返回 (None, None) 但是来自 unix 命令“文件”
:~> file datasets/test datasets/test: gzip compressed data, was "iostat_collection", from Unix, last modified: Thu Jan 29 07:09:34 2015 :~> 文件数据集/测试数据集/测试:gzip 压缩数据,是“iostat_collection”,来自 Unix,最后修改时间:2015 年 1 月 29 日星期四 07:09:34
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.