简体   繁体   中英

How to check empty gzip file in Python

I don't want to use OS commands as that makes it is OS dependent.

This is available in tarfile , tarfile.is_tarfile(filename) , to check if a file is a tar file or not.

I am not able to find any relevant commands in the gzip module.


EDIT : Why do I need this: I have list of gzip files, these vary in sizes (1-10 GB) and some are empty. Before reading a file (using pandas.read_csv ), I want to check if the file is empty or not, because for empty files I get an error in pandas.read_csv . (Error like: Expected 15 columns and found -1)

Sample command with error:

import pandas as pd
pd.read_csv('C:\Users\...\File.txt.gz', compression='gzip', names={'a', 'b', 'c'}, header=False)
Too many columns specified: expected 3 and found -1

pandas version is 0.16.2

file used for testing, it is just a gzip of empty file.

Unfortunately, the gzip module does not expose any functionality equivalent to the -l list option of the gzip program. But in Python 3 you can easily get the size of the uncompressed data by calling the .seek method with a whence argument of 2, which signifies positioning relative to the end of the (uncompressed) data stream.

.seek returns the new byte position, so .seek(0, 2) returns the byte offset of the end of the uncompressed file, ie, the file size. Thus if the uncompressed file is empty the .seek call will return 0.

import gzip

def gz_size(fname):
    with gzip.open(fname, 'rb') as f:
        return f.seek(0, whence=2)

Here's a function that will work on Python 2, tested on Python 2.6.6.

def gz_size(fname):
    f = gzip.open(fname, 'rb')
    data = f.read()
    f.close()
    return len(data)

You can read about .seek and other methods of the GzipFile class using the pydoc program. Just run pydoc gzip in the shell.


Alternatively, if you wish to avoid decompressing the file you can (sort of) read the uncompressed data size directly from the .gz file. The size is stored in the last 4 bytes of the file as a little-endian unsigned long, so it's actually the size modulo 2**32, therefore it will not be the true size if the uncompressed data size is >= 4GB.

This code works on both Python 2 and Python 3.

import gzip
import struct

def gz_size(fname):
    with open(fname, 'rb') as f:
        f.seek(-4, 2)
        data = f.read(4)
    size = struct.unpack('<L', data)[0]
    return size

However, this method is not reliable, as Mark Adler ( gzip co-author) mentions in the comments:

There are other reasons that the length at the end of the gzip file would not represent the length of the uncompressed data. (Concatenated gzip streams, padding at the end of the gzip file.) It should not be used for this purpose. It's only there as an integrity check on the data.


Here is another solution. It does not decompress the whole file. It returns True if the uncompressed data in the input file is of zero length, but it also returns True if the input file itself is of zero length. If the input file is not of zero length and is not a gzip file then OSError is raised.

import gzip

def gz_is_empty(fname):
    ''' Test if gzip file fname is empty
        Return True if the uncompressed data in fname has zero length
        or if fname itself has zero length
        Raises OSError if fname has non-zero length and is not a gzip file
    '''
    with gzip.open(fname, 'rb') as f:
        data = f.read(1)
    return len(data) == 0

If you want to check whether a file is a valid Gzip file, you can open it and read one byte from it. If it succeeds, the file is quite probably a gzip file, with one caveat: an empty file also succeeds this test.

Thus we get

def is_gz_file(name):
    with gzip.open(name, 'rb') as f:
        try:
            file_content = f.read(1)
            return True
        except:
            return False

However, as I stated earlier, a file which is empty (0 bytes), still succeeds this test, so you'd perhaps want to ensure that the file is not empty:

def is_gz_file(name):
    if os.stat(name).ST_SIZE == 0:
        return False

    with gzip.open(name, 'rb') as f:
        try:
            file_content = f.read(1)
            return True
        except:
            return False

EDIT:

as the question was now changed to "a gzip file that doesn't have empty contents", then:

def is_nonempty_gz_file(name):
    with gzip.open(name, 'rb') as f:
        try:
            file_content = f.read(1)
            return len(file_content) > 0
        except:
            return False

UPDATE:

i would strongly recommend to upgrade to pandas 0.18.1 (currently the latest version), as each new version of pandas introduces nice new features and fixes tons of old bugs. And the actual version (0.18.1) will process your empty files just out of the box (see demo below).

If you can't upgrade to a newer version, then make use of @MartijnPieters recommendation - catch the exception, instead of checking (follow the Easier to ask for forgiveness than permission paradigm)

OLD answer: a small demonstration (using pandas 0.18.1), which tolerates empty files, different number of columns, etc.

I tried to reproduce your error (trying empty CSV.gz, different number of columns, etc.), but i didn't manage to reproduce your exception using pandas v. 0.18.1:

import os
import glob
import gzip
import pandas as pd

fmask = 'd:/temp/.data/37874936/*.csv.gz'

files = glob.glob(fmask)

cols = ['a','b','c']

for f in files:
    # actually there is no need to use `compression='gzip'` - pandas will guess it itself
    # i left it in order to be sure that we are using the same parameters ...
    df = pd.read_csv(f, header=None, names=cols, compression='gzip', sep=',')
    print('\nFILE: [{:^40}]'.format(f))
    print('{:-^60}'.format(' ORIGINAL contents '))
    print(gzip.open(f, 'rt').read())
    print('{:-^60}'.format(' parsed DF '))
    print(df) 

Output:

FILE: [    d:/temp/.data/37874936\1.csv.gz     ]
-------------------- ORIGINAL contents ---------------------
11,12,13
14,15,16


------------------------ parsed DF -------------------------
    a   b   c
0  11  12  13
1  14  15  16

FILE: [  d:/temp/.data/37874936\empty.csv.gz   ]
-------------------- ORIGINAL contents ---------------------

------------------------ parsed DF -------------------------
Empty DataFrame
Columns: [a, b, c]
Index: []

FILE: [d:/temp/.data/37874936\zz_5_columns.csv.gz]
-------------------- ORIGINAL contents ---------------------
1,2,3,4,5
11,22,33,44,55

------------------------ parsed DF -------------------------
        a   b   c
1  2    3   4   5
11 22  33  44  55

FILE: [d:/temp/.data/37874936\z_bad_CSV.csv.gz ]
-------------------- ORIGINAL contents ---------------------
1
5,6,7
1,2
8,9,10,5,6

------------------------ parsed DF -------------------------
   a    b     c
0  1  NaN   NaN
1  5  6.0   7.0
2  1  2.0   NaN
3  8  9.0  10.0

FILE: [d:/temp/.data/37874936\z_single_column.csv.gz]
-------------------- ORIGINAL contents ---------------------
1
2
3

------------------------ parsed DF -------------------------
   a   b   c
0  1 NaN NaN
1  2 NaN NaN
2  3 NaN NaN

Can you post a sample CSV, causing this error or upload it somewhere and post here a link?

Unfortunately, any such attempt will likely have a fair bit of overhead, it would likely be cheaper to catch the exception, such as users commented above. A gzip file defines a few fixed size regions, as follows:

Fixed Regions

First, there are 2 bytes for the Gzip magic number, 1 byte for the compression method, 1 byte for the flags, then 4 more bytes for the MTIME (file creation time), 2 bytes for extra flags, and two more bytes for the operating system, giving us a total of 12 bytes so far.

This looks as follows (from the link above):

+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+

Variable Regions

However, this is where things get tricky (and impossible to check without using a gzip module or another deflator).

If extra fields were set, there is a variable region of XLEN bytes set afterwards, which looks as follows:

(if FLG.FEXTRA set)
+---+---+=================================+
| XLEN  |...XLEN bytes of "extra field"...| (more-->)
+---+---+=================================+

After this, there is then a region of N bytes, with a zero-terminated string for the file name (which is, by default, stored):

(if FLG.FNAME set)
+=========================================+
|...original file name, zero-terminated...| (more-->)
+=========================================+

We then have comments:

(if FLG.FCOMMENT set)
+===================================+
|...file comment, zero-terminated...| (more-->)
+===================================+

And finally, a CRC16 (a cyclic redundancy check, in order to make sure the file header then works, all before we get into the variable, compressed data.

Solution

So, any sort of fixed size check will be dependent on whether the filename, or if it was written via pipe ( gzip -c "Compress this data" > myfile.gz ), other fields, and comments, all which can be defined for null files. So, how do we get around this? Simple, use the gzip module:

import gzip

def check_null(path):
    '''
    Returns an empty string for a null file, which is falsey, 
    and returns a non-empty string otherwise (which is truthey)
    '''

    with gzip.GzipFile(path, 'rb') as f:
        return f.read(1)

This will check if any data exists inside the created file, while only reading a small section of the data. However, this takes a while, it's easier to ask for forgiveness than ask permission.

import contextlib       # python3 only, use a try/except block for Py2
import pandas as pd

with contexlib.suppress(pd.parser.CParserError as error):
    df = pd.read_csv(path, compression='gzip', names={'a', 'b', 'c'}, header=False)
    # do something here

Try something like this:

def is_empty(gzfile):
    size = gzfile.read().
    if len(size) > 0:
         gzfile.rewind()
         return False
    else:
         return True
import gzip

with gzip.open("pCSV.csv.gz", 'r') as f:

    f.seek(3)
    couterA = f.tell()

    f.seek(2,0)
    counterB = f.tell()

    if(couterA > counterB):
        print "NOT EMPTY"
    else:
        print "EMPTY"

This should do it without reading the file.

Looking through the source code for the Python 2.7 version of the gzip module, it seems to immediately return EOF, not only in the case where the gzipped file is zero bytes, but also in the case that the gzip file is zero bytes, which is arguably a bug.

However, for your particular use-case, we can do a little better, by also confirming the gzipped file is a valid CSV file.

This code...

import csv
import gzip

# Returns true if the specified filename is a valid gzip'd CSV file
# If the optional 'columns' parameter is specified, also check that
# the first row has that many columns
def is_valid(filename, columns=None):

    try:

        # Chain a CSV reader onto a gzip reader
        csv_file = csv.reader(gzip.open(filename))

        # This will try to read the first line
        # If it's not a valid gzip, this will raise IOError
        for row in csv_file:

            # We got at least one row
            # Bail out here if we don't care how many columns we have
            if columns is None:
                return True

            # Check it has the right number of columns
            return len(row) == columns

        else:

            # There were no rows
            return False

    except IOError:

        # This is not a valid gzip file
        return False


# Example to check whether File.txt.gz is valid
result = is_valid('File.txt.gz')

# Example to check whether File.txt.gz is valid, and has three columns
result = is_valid('File.txt.gz', columns=3)

...should correctly handle the following error cases...

  1. The gzip file is zero bytes
  2. The gzip file is not a valid gzip file
  3. The gzipped file is zero bytes
  4. The gzipped file is not zero bytes, but contains no CSV data
  5. (Optionally) The gzipped file contains CSV data, but with the wrong number of columns

I had a few hundred thousand gzip files, only a few of which are zero-sized, mounted on a network share. I was forced to use the following optimization. It is brittle, but in the (very frequent) case in which you have a large number of files generated using the same method, the sum of all the bytes other than the name of the payload are a constant.

Then you can check for a zero-sized payload by:

  1. Computing that constant over one file. You can code it up , but I find it simpler to just use command-line gzip (and this whole answer is an ugly hack anyway).
  2. examining only the inode for the rest of the files, instead of opening each file, which can be orders of magnitude faster:
from os import stat
from os.path import basename

# YMMV with len_minus_file_name
def is_gzip_empty(file_name, len_minus_file_name=23): 
    return os.stat(file_name).st_size - len(basename(file_name)) == len_minus_file_name

This could break in many ways. Caveat emptor. Only use it if other methods are not practical.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM