如何在不使用 Pandas 的情况下将 Numpy 转换为 Parquet？

Question

The traditional way to save a numpy object to parquet is to use Pandas as an intermediate.将 numpy 对象保存到 parquet 的传统方法是使用 Pandas 作为中间体。 However, I am working with a lot of data, which doesn't fit in Pandas without crashing my enviroment because in Pandas, the data takes up a lot of RAM.但是，我正在处理大量数据，这些数据不适合 Pandas 而不会破坏我的环境，因为在 Pandas 中，数据占用了大量 RAM。

I need to save to Parquet because I am working with variable length arrays in numpy, so for that parquet actually saves to a smaller space than .npy or .hdf5 .我需要保存到 Parquet，因为我在 numpy 中使用可变长度数组，所以对于那个 parquet 实际上保存到比 .npy 或 .hdf5 更小的空间。

The following code is a minimal example that downloads a small chunk of my data, and converts between pandas objects and numpy objects to measure how much RAM they consume, and save to npy and parquet files to see how much disk space they take.下面的代码是一个最小的示例，它下载我的一小块数据，并在 pandas 对象和 numpy 对象之间进行转换以测量它们消耗了多少 RAM，并保存到 npy 和 parquet 文件中以查看它们占用了多少磁盘空间。

# Download sample file, about 10 mbs

from sys import getsizeof
import requests
import pickle
import numpy as np
import pandas as pd
import os

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')

sampleDF = pd.read_pickle('sample.pkl')

sampleDF.to_parquet( 'test1.pqt', compression = 'brotli', index = False )

# Parquet file takes up little space 
os.path.getsize('test1.pqt')

6594712 6594712

getsizeof(sampleDF)

22827172 22827172

sampleDF['totalCites2'] = sampleDF['totalCites2'].apply(lambda x: np.array(x))

#RAM reduced if the variable length batches are in numpy
getsizeof(sampleDF)

22401764 22401764

#Much less RAM as a numpy object 
sampleNumpy = sampleDF.values
getsizeof(sampleNumpy)

112 112

# Much more space in .npy form 
np.save( 'test2.npy', sampleNumpy) 
os.path.getsize('test2.npy')

20825382 20825382

# Numpy savez. Not as good as parquet 
np.savez_compressed( 'test3.npy', sampleNumpy )
os.path.getsize('test3.npy.npz')

9873964 9873964

Answer 1

You can read/write numpy arrays to parquet directly using Apache Arrow (pyarrow), which is also the underlying backend to parquet in pandas.您可以直接使用 Apache Arrow (pyarrow) 读取/写入 numpy 数组到 parquet，它也是 Pandas 中 parquet 的底层后端。 Note that parquet is a tabular format, so creating some table is still necessary.请注意，parquet 是一种表格格式，因此仍然需要创建一些表格。

import numpy as np
import pyarrow as pa

np_arr = np.array([1.3, 4.22, -5], dtype=np.float32)
pa_table = pa.table({"data": np_arr})
pa.parquet.write_table(pa_table, "test.parquet")

refs: numpy to pyarrow , pyarrow.parquet.write_table参考： numpy 到 pyarrow ， pyarrow.parquet.write_table

Answer 2

Parquet format can be written using pyarrow , the correct import syntax is: Parquet 格式可以使用pyarrow编写，正确的导入语法是：

import pyarrow.parquet as pq so you can use pq.write_table . import pyarrow.parquet as pq以便您可以使用pq.write_table 。 Otherwise using import pyarrow as pa', 'py.parquet.write_table will return: AttributeError: module 'pyarrow' has no attribute 'parquet' .否则使用import pyarrow as pa', 'py.parquet.write_table将返回： AttributeError: module 'pyarrow' has no attribute 'parquet' 。

Pyarrow requires the data to be organized columns-wise, which means in the case of numpy multydimensional arrays, you need to assign each dimension to a specific field in the parquet column. Pyarrow 要求按列组织数据，这意味着在numpy多维数组的情况下，您需要将每个维度分配给parquet列中的特定字段。

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq


ndarray = np.array(
    [
        [4.96266477e05, 4.55342071e06, -1.03240000e02, -3.70000000e01, 2.15592864e01],
        [4.96258372e05, 4.55344875e06, -1.03400000e02, -3.85000000e01, 2.40120775e01],
        [4.96249387e05, 4.55347732e06, -1.03330000e02, -3.47500000e01, 2.70718535e01],
    ]
)

ndarray_table = pa.table(
    {
        "X": ndarray[:, 0],
        "Y": ndarray[:, 1],
        "Z": ndarray[:, 2],
        "Amp": ndarray[:, 3],
        "Ang": ndarray[:, 4],
    }
)

pq.write_table(ndarray_table, "ndarray.parquet")

如何在不使用 Pandas 的情况下将 Numpy 转换为 Parquet？

问题描述

2 个解决方案

解决方案1
6 已采纳 2020-12-27 08:25:40

解决方案2
1 2021-02-23 20:48:58

如何在不使用 Pandas 的情况下将 Numpy 转换为 Parquet？

问题描述

2 个解决方案

解决方案1 6 已采纳 2020-12-27 08:25:40

解决方案2 1 2021-02-23 20:48:58

解决方案1
6 已采纳 2020-12-27 08:25:40

解决方案2
1 2021-02-23 20:48:58