I am using numpy.fromfile
to construct an array which I can pass to the pandas.DataFrame
constructor
import numpy as np
import pandas as pd
def read_best_file(file, **kwargs):
'''
Loads best price data into a dataframe
'''
names = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
formats = [ 'u8', 'i4', 'f8', 'i4', 'f8' ]
offsets = [ 0, 8, 12, 20, 24 ]
dt = np.dtype({
'names': names,
'formats': formats,
'offsets': offsets
})
return pd.DataFrame(np.fromfile(file, dt))
I would like to extend this method to work with gzipped files.
According to the numpy.fromfile documentation, the first parameter is file:
file : file or str Open file object or filename
As such, I added the following to check for a gzip file path:
if isinstance(file, str) and file.endswith(".gz"):
file = gzip.open(file, "r")
However, when I try pass this through the fromfile
constructor I get an IOError
:
IOError: first argument must be an open file
Question:
How can I call numpy.fromfile
with a gzipped file?
Edit:
As per request in comments, showing implementation which checks for gzipped files:
def read_best_file(file, **kwargs):
'''
Loads best price data into a dataframe
'''
names = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
formats = [ 'u8', 'i4', 'f8', 'i4', 'f8' ]
offsets = [ 0, 8, 12, 20, 24 ]
dt = np.dtype({
'names': names,
'formats': formats,
'offsets': offsets
})
if isinstance(file, str) and file.endswith(".gz"):
file = gzip.open(file, "r")
return pd.DataFrame(np.fromfile(file, dt))
open.gzip()
doesn't return a true file
object. It's duck one .. it walks like a duck, sounds like a duck, but isn't quite a duck per numpy
. So numpy
is being strict (since much is written in lower level C code, it might require an actual file descriptor.)
You can get the underlying file
from the gzip.open()
call, but that's just going to get you the compressed stream.
This is what I would do: I would use subprocess.Popen()
to invoke zcat
to uncompress the file as a stream.
>>> import subprocess
>>> p = subprocess.Popen(["/usr/bin/zcat", "foo.txt.gz"], stdout=subprocess.PIPE)
>>> type(p.stdout)
<type 'file'>
>>> p.stdout.read()
'hello world\n'
Now you can pass p.stdout
as a file
object to numpy
:
np.fromfile(p.stdout, ...)
I have had success reading arrays of raw binary data from gzipped files by feeding the read() results through numpy.frombuffer(). This code works in Python 3.7.3, and perhaps in earlier versions also.
# Example: read short integers (signed) from gzipped raw binary file
import gzip
import numpy as np
fname_gzipped = 'my_binary_data.dat.gz'
raw_dtype = np.int16
with gzip.open(fname_gzipped, 'rb') as f:
from_gzipped = np.frombuffer(f.read(), dtype=raw_dtype)
# Demonstrate equivalence with direct np.fromfile()
fname_raw = 'my_binary_data.dat'
from_raw = np.fromfile(fname_raw, dtype=raw_dtype)
# True
print('raw binary and gunzipped are the same: {}'.format(
np.array_equiv(from_gzipped, from_raw)))
# False
wrong_dtype = np.uint8
binary_as_wrong_dtype = np.fromfile(fname_raw, dtype=wrong_dtype)
print('wrong dtype and gunzipped are the same: {}'.format(
np.array_equiv(from_gzipped, binary_as_wrong_dtype)))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.