[英]How to convert Numpy to Parquet without using Pandas?
The traditional way to save a numpy object to parquet is to use Pandas as an intermediate.将 numpy 对象保存到 parquet 的传统方法是使用 Pandas 作为中间体。 However, I am working with a lot of data, which doesn't fit in Pandas without crashing my enviroment because in Pandas, the data takes up a lot of RAM.
但是,我正在处理大量数据,这些数据不适合 Pandas 而不会破坏我的环境,因为在 Pandas 中,数据占用了大量 RAM。
I need to save to Parquet because I am working with variable length arrays in numpy, so for that parquet actually saves to a smaller space than .npy or .hdf5 .我需要保存到 Parquet,因为我在 numpy 中使用可变长度数组,所以对于那个 parquet 实际上保存到比 .npy 或 .hdf5 更小的空间。
The following code is a minimal example that downloads a small chunk of my data, and converts between pandas objects and numpy objects to measure how much RAM they consume, and save to npy and parquet files to see how much disk space they take.下面的代码是一个最小的示例,它下载我的一小块数据,并在 pandas 对象和 numpy 对象之间进行转换以测量它们消耗了多少 RAM,并保存到 npy 和 parquet 文件中以查看它们占用了多少磁盘空间。
# Download sample file, about 10 mbs
from sys import getsizeof
import requests
import pickle
import numpy as np
import pandas as pd
import os
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')
sampleDF = pd.read_pickle('sample.pkl')
sampleDF.to_parquet( 'test1.pqt', compression = 'brotli', index = False )
# Parquet file takes up little space
os.path.getsize('test1.pqt')
6594712
6594712
getsizeof(sampleDF)
22827172
22827172
sampleDF['totalCites2'] = sampleDF['totalCites2'].apply(lambda x: np.array(x))
#RAM reduced if the variable length batches are in numpy
getsizeof(sampleDF)
22401764
22401764
#Much less RAM as a numpy object
sampleNumpy = sampleDF.values
getsizeof(sampleNumpy)
112
112
# Much more space in .npy form
np.save( 'test2.npy', sampleNumpy)
os.path.getsize('test2.npy')
20825382
20825382
# Numpy savez. Not as good as parquet
np.savez_compressed( 'test3.npy', sampleNumpy )
os.path.getsize('test3.npy.npz')
9873964
9873964
You can read/write numpy arrays to parquet directly using Apache Arrow (pyarrow), which is also the underlying backend to parquet in pandas.您可以直接使用 Apache Arrow (pyarrow) 读取/写入 numpy 数组到 parquet,它也是 Pandas 中 parquet 的底层后端。 Note that parquet is a tabular format, so creating some table is still necessary.
请注意,parquet 是一种表格格式,因此仍然需要创建一些表格。
import numpy as np
import pyarrow as pa
np_arr = np.array([1.3, 4.22, -5], dtype=np.float32)
pa_table = pa.table({"data": np_arr})
pa.parquet.write_table(pa_table, "test.parquet")
refs: numpy to pyarrow , pyarrow.parquet.write_table参考: numpy 到 pyarrow , pyarrow.parquet.write_table
Parquet format can be written using pyarrow
, the correct import syntax is: Parquet 格式可以使用
pyarrow
编写,正确的导入语法是:
import pyarrow.parquet as pq
so you can use pq.write_table
. import pyarrow.parquet as pq
以便您可以使用pq.write_table
。 Otherwise using import pyarrow as pa', 'py.parquet.write_table
will return: AttributeError: module 'pyarrow' has no attribute 'parquet'
.否则使用
import pyarrow as pa', 'py.parquet.write_table
将返回: AttributeError: module 'pyarrow' has no attribute 'parquet'
。
Pyarrow requires the data to be organized columns-wise, which means in the case of numpy
multydimensional arrays, you need to assign each dimension to a specific field in the parquet
column. Pyarrow 要求按列组织数据,这意味着在
numpy
多维数组的情况下,您需要将每个维度分配给parquet
列中的特定字段。
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
ndarray = np.array(
[
[4.96266477e05, 4.55342071e06, -1.03240000e02, -3.70000000e01, 2.15592864e01],
[4.96258372e05, 4.55344875e06, -1.03400000e02, -3.85000000e01, 2.40120775e01],
[4.96249387e05, 4.55347732e06, -1.03330000e02, -3.47500000e01, 2.70718535e01],
]
)
ndarray_table = pa.table(
{
"X": ndarray[:, 0],
"Y": ndarray[:, 1],
"Z": ndarray[:, 2],
"Amp": ndarray[:, 3],
"Ang": ndarray[:, 4],
}
)
pq.write_table(ndarray_table, "ndarray.parquet")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.