简体   繁体   中英

numpy: fromfile for gzipped file

I am using numpy.fromfile to construct an array which I can pass to the pandas.DataFrame constructor

import numpy as np
import pandas as pd

def read_best_file(file, **kwargs):
    '''
    Loads best price data into a dataframe
    '''
    names   = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
    formats = [ 'u8',   'i4',       'f8',        'i4',       'f8'        ]
    offsets = [  0,      8,          12,          20,         24         ]

    dt = np.dtype({
            'names': names, 
            'formats': formats,
            'offsets': offsets 
        })
    return pd.DataFrame(np.fromfile(file, dt))

I would like to extend this method to work with gzipped files.

According to the numpy.fromfile documentation, the first parameter is file:

 file : file or str Open file object or filename 

As such, I added the following to check for a gzip file path:

if isinstance(file, str) and file.endswith(".gz"):
    file = gzip.open(file, "r")

However, when I try pass this through the fromfile constructor I get an IOError :

IOError: first argument must be an open file

Question:

How can I call numpy.fromfile with a gzipped file?

Edit:

As per request in comments, showing implementation which checks for gzipped files:

def read_best_file(file, **kwargs):
    '''
    Loads best price data into a dataframe
    '''
    names   = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
    formats = [ 'u8',   'i4',       'f8',        'i4',       'f8'        ]
    offsets = [  0,      8,          12,          20,         24         ]

    dt = np.dtype({
            'names': names, 
            'formats': formats,
            'offsets': offsets 
        })

    if isinstance(file, str) and file.endswith(".gz"):
        file = gzip.open(file, "r")

    return pd.DataFrame(np.fromfile(file, dt))

open.gzip() doesn't return a true file object. It's duck one .. it walks like a duck, sounds like a duck, but isn't quite a duck per numpy . So numpy is being strict (since much is written in lower level C code, it might require an actual file descriptor.)

You can get the underlying file from the gzip.open() call, but that's just going to get you the compressed stream.

This is what I would do: I would use subprocess.Popen() to invoke zcat to uncompress the file as a stream.

>>> import subprocess
>>> p = subprocess.Popen(["/usr/bin/zcat", "foo.txt.gz"], stdout=subprocess.PIPE)
>>> type(p.stdout)
<type 'file'>
>>> p.stdout.read()
'hello world\n'

Now you can pass p.stdout as a file object to numpy :

np.fromfile(p.stdout, ...)

I have had success reading arrays of raw binary data from gzipped files by feeding the read() results through numpy.frombuffer(). This code works in Python 3.7.3, and perhaps in earlier versions also.

# Example: read short integers (signed) from gzipped raw binary file

import gzip
import numpy as np

fname_gzipped = 'my_binary_data.dat.gz'
raw_dtype = np.int16
with gzip.open(fname_gzipped, 'rb') as f:
    from_gzipped = np.frombuffer(f.read(), dtype=raw_dtype)

# Demonstrate equivalence with direct np.fromfile()
fname_raw = 'my_binary_data.dat'
from_raw = np.fromfile(fname_raw, dtype=raw_dtype)

# True
print('raw binary and gunzipped are the same: {}'.format(
    np.array_equiv(from_gzipped, from_raw)))

# False
wrong_dtype = np.uint8
binary_as_wrong_dtype = np.fromfile(fname_raw, dtype=wrong_dtype)
print('wrong dtype and gunzipped are the same: {}'.format(
    np.array_equiv(from_gzipped, binary_as_wrong_dtype)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM