简体   繁体   English

如何判断文件是否是 gzip 压缩的?

[英]How to tell if a file is gzip compressed?

I have a Python program which is going to take text files as input.我有一个 Python 程序,它将把文本文件作为输入。 However, some of these files may be gzip compressed.但是,其中一些文件可能是 gzip 压缩的。

Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?是否有跨平台的、可从 Python 使用的方式来确定文件是否是 gzip 压缩的?

Is the following reliable or could an ordinary text file 'accidentally' look gzip-like enough for me to get false positives?以下是可靠的还是普通的文本文件“不小心”看起来像 gzip 一样足以让我得到误报?

try:
    gzip.GzipFile(filename, 'r')
    # compressed
    # ...
except:
    # not compressed
    # ...

The magic number for gzip compressed files is 1f 8b . gzip 压缩文件的幻数1f 8b Although testing for this is not 100% reliable, it is highly unlikely that "ordinary text files" start with those two bytes—in UTF-8 it's not even legal.尽管对此进行的测试不是 100% 可靠,但“普通文本文件”极不可能以这两个字节开头——在 UTF-8 中它甚至是不合法的。

Usually gzip compressed files sport the suffix .gz though.不过,通常 gzip 压缩文件的后缀是.gz Even gzip(1) itself won't unpack files without it unless you --force it to.甚至gzip(1)本身也不会在没有它的情况下解压缩文件,除非您--force它。 You could conceivably use that, but you'd still have to deal with a possible IOError (which you have to in any case).您可以想象使用它,但您仍然必须处理可能的 IOError (无论如何您都必须这样做)。

One problem with your approach is, that gzip.GzipFile() will not throw an exception if you feed it an uncompressed file.您的方法的一个问题是,如果您提供一个未压缩的文件, gzip.GzipFile()不会引发异常。 Only a later read() will.只有稍后read()才会。 This means, that you would probably have to implement some of your program logic twice.这意味着,您可能必须两次实现某些程序逻辑。 Ugly.丑。

Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?是否有跨平台的、可从 Python 使用的方式来确定文件是否是 gzip 压缩的?

The accepted answer explains how one can detect a gzip compressed file in general: test if the first two bytes are 1f 8b .接受的答案解释了一般如何检测 gzip 压缩文件:测试前两个字节是否为1f 8b However it does not show how to implement it in Python.但是它没有展示如何在 Python 中实现它。

Here is one way:这是一种方法:

def is_gz_file(filepath):
    with open(filepath, 'rb') as test_f:
        return test_f.read(2) == b'\x1f\x8b'

Testing the magic number of a gzip file is the only reliable way to go.测试 gzip 文件的幻数是唯一可靠的方法。 However, as of python3.7 there is no need to mess with comparing bytes yourself anymore.但是,从 python3.7 开始,不再需要自己比较字节。 The gzip module will compare the bytes for you and raise an exception if they do not match! gzip 模块将为您比较字节,如果不匹配则引发异常!

As of python3.7, this works从python3.7开始,这有效

import gzip
with gzip.open(input_file, 'r') as fh:
    try:
        fh.read(1)
    except OSError:
        print('input_file is not a valid gzip file by OSError')

As of python3.8, this also works:从 python3.8 开始,这也有效:

import gzip
with gzip.open(input_file, 'r') as fh:
    try:
        fh.read(1)
    except gzip.BadGzipFile:
        print('input_file is not a valid gzip file by BadGzipFile')

gzip itself will raise an OSError if it's not a gzipped file. gzip本身将引发OSError如果它不是一个压缩文件。

>>> with gzip.open('README.md', 'rb') as f:
...     f.read()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'# ')

Can combine this approach with some others to increase confidence, such as checking the mimetype or looking for a magic number in the file header (see other answers for an example) and checking the extension.可以将此方法与其他一些方法结合使用以增加信心,例如检查 mimetype 或在文件头中查找幻数(请参阅其他答案的示例)并检查扩展名。

import pathlib

if '.gz' in pathlib.Path(filepath).suffixes:
   # some more inexpensive checks until confident we can attempt to decompress
   # ...
   try ...
     ...
   except OSError as e:
     ...

Import the mimetypes module.导入mimetypes模块。 It can automatically guess what kind of file you have, and if it is compressed.它可以自动猜测您拥有什么样的文件,以及它是否被压缩。

ie

mimetypes.guess_type('blabla.txt.gz')

returns:返回:

('text/plain', 'gzip') ('文本/纯文本', 'gzip')

Doesn't seem to work well in python3...在python3中似乎不太好用...

import mimetypes
filename = "./datasets/test"

def file_type(filename):
    type = mimetypes.guess_type(filename)
    return type
print(file_type(filename))

returns (None, None) But from the unix command "File"返回 (None, None) 但是来自 unix 命令“文件”

:~> file datasets/test datasets/test: gzip compressed data, was "iostat_collection", from Unix, last modified: Thu Jan 29 07:09:34 2015 :~> 文件数据集/测试数据集/测试:gzip 压缩数据,是“iostat_collection”,来自 Unix,最后修改时间:2015 年 1 月 29 日星期四 07:09:34

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何获取gzip压缩文件的随机访问权限 - How to obtain random access of a gzip compressed file 如何将 gzip 压缩的 json 行文件读入 PySpark dataframe? - How to read a gzip compressed json lines file into PySpark dataframe? 压缩为 Gzip 的 Json 大文件的随机索引 - Random indexing of large Json file compressed as Gzip 如何读取gzip压缩的CZI图像? - How to read gzip compressed CZI image? 如何在不提取所有内容的情况下访问 gzip 压缩文件夹的子文件夹中的文件? - How can I access a file that is in a subfolder of a gzip-compressed folder without extracting everything? 如何从python中的gzip压缩文件中获取随机行而不将其读入内存 - How to get a random line from within a gzip compressed file in python without reading it into memory 我们可以将压缩文件 (Gzip) 直接推送到 Kinesis Streams 中吗? - can we push the compressed file (Gzip) directly into Kinesis Streams? 如何在python中解码使用gzip压缩的源代码 - How to decode a source code which is compressed with gzip in python 如何加载以 gzip 格式压缩的二进制文件? - How do you load binary files compressed in gzip format? 如何解码python中HTTP响应中返回的gzip压缩数据? - How to decode the gzip compressed data returned in a HTTP Response in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM