简体   繁体   English

如何在不使用 Pandas 的情况下将 Numpy 转换为 Parquet?

[英]How to convert Numpy to Parquet without using Pandas?

The traditional way to save a numpy object to parquet is to use Pandas as an intermediate.将 numpy 对象保存到 parquet 的传统方法是使用 Pandas 作为中间体。 However, I am working with a lot of data, which doesn't fit in Pandas without crashing my enviroment because in Pandas, the data takes up a lot of RAM.但是,我正在处理大量数据,这些数据不适合 Pandas 而不会破坏我的环境,因为在 Pandas 中,数据占用了大量 RAM。

I need to save to Parquet because I am working with variable length arrays in numpy, so for that parquet actually saves to a smaller space than .npy or .hdf5 .我需要保存到 Parquet,因为我在 numpy 中使用可变长度数组,所以对于那个 parquet 实际上保存到比 .npy 或 .hdf5 更小的空间。

The following code is a minimal example that downloads a small chunk of my data, and converts between pandas objects and numpy objects to measure how much RAM they consume, and save to npy and parquet files to see how much disk space they take.下面的代码是一个最小的示例,它下载我的一小块数据,并在 pandas 对象和 numpy 对象之间进行转换以测量它们消耗了多少 RAM,并保存到 npy 和 parquet 文件中以查看它们占用了多少磁盘空间。

# Download sample file, about 10 mbs

from sys import getsizeof
import requests
import pickle
import numpy as np
import pandas as pd
import os

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')

sampleDF = pd.read_pickle('sample.pkl')

sampleDF.to_parquet( 'test1.pqt', compression = 'brotli', index = False )

# Parquet file takes up little space 
os.path.getsize('test1.pqt')

6594712 6594712

getsizeof(sampleDF)

22827172 22827172

sampleDF['totalCites2'] = sampleDF['totalCites2'].apply(lambda x: np.array(x))

#RAM reduced if the variable length batches are in numpy
getsizeof(sampleDF)

22401764 22401764

#Much less RAM as a numpy object 
sampleNumpy = sampleDF.values
getsizeof(sampleNumpy)

112 112

# Much more space in .npy form 
np.save( 'test2.npy', sampleNumpy) 
os.path.getsize('test2.npy')

20825382 20825382

# Numpy savez. Not as good as parquet 
np.savez_compressed( 'test3.npy', sampleNumpy )
os.path.getsize('test3.npy.npz')

9873964 9873964

You can read/write numpy arrays to parquet directly using Apache Arrow (pyarrow), which is also the underlying backend to parquet in pandas.您可以直接使用 Apache Arrow (pyarrow) 读取/写入 numpy 数组到 parquet,它也是 Pandas 中 parquet 的底层后端。 Note that parquet is a tabular format, so creating some table is still necessary.请注意,parquet 是一种表格格式,因此仍然需要创建一些表格。

import numpy as np
import pyarrow as pa

np_arr = np.array([1.3, 4.22, -5], dtype=np.float32)
pa_table = pa.table({"data": np_arr})
pa.parquet.write_table(pa_table, "test.parquet")

refs: numpy to pyarrow , pyarrow.parquet.write_table参考: numpy 到 pyarrowpyarrow.parquet.write_table

Parquet format can be written using pyarrow , the correct import syntax is: Parquet 格式可以使用pyarrow编写,正确的导入语法是:

import pyarrow.parquet as pq so you can use pq.write_table . import pyarrow.parquet as pq以便您可以使用pq.write_table Otherwise using import pyarrow as pa', 'py.parquet.write_table will return: AttributeError: module 'pyarrow' has no attribute 'parquet' .否则使用import pyarrow as pa', 'py.parquet.write_table将返回: AttributeError: module 'pyarrow' has no attribute 'parquet'

Pyarrow requires the data to be organized columns-wise, which means in the case of numpy multydimensional arrays, you need to assign each dimension to a specific field in the parquet column. Pyarrow 要求按列组织数据,这意味着在numpy多维数组的情况下,您需要将每个维度分配给parquet列中的特定字段。

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq


ndarray = np.array(
    [
        [4.96266477e05, 4.55342071e06, -1.03240000e02, -3.70000000e01, 2.15592864e01],
        [4.96258372e05, 4.55344875e06, -1.03400000e02, -3.85000000e01, 2.40120775e01],
        [4.96249387e05, 4.55347732e06, -1.03330000e02, -3.47500000e01, 2.70718535e01],
    ]
)

ndarray_table = pa.table(
    {
        "X": ndarray[:, 0],
        "Y": ndarray[:, 1],
        "Z": ndarray[:, 2],
        "Amp": ndarray[:, 3],
        "Ang": ndarray[:, 4],
    }
)

pq.write_table(ndarray_table, "ndarray.parquet")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在没有足够 RAM 的情况下使用 Pandas 打开巨大的镶木地板文件 - How to open huge parquet file using Pandas without enough RAM Python Pandas 使用 Fastparquet 将 CSV 转换为 Parquet - Python Pandas to convert CSV to Parquet using Fastparquet 如何使用 Pandas 编写分区的 Parquet 文件 - How to write a partitioned Parquet file using Pandas 如何使用 pandas 使用 zstandard 压缩镶木地板文件 - How to compress parquet file with zstandard using pandas 如何使用 Pandas 读取镶木地板文件 - How to read parquet file using Pandas 如何将数据框pandas转换为numpy中的列表,数组numpy中没有单词“list” - How to convert data frame pandas into list in numpy without word "list" in array numpy 如何使用pandas将excel文件数据转换为numpy数组? - How to convert an excel file data into numpy array using pandas? 如何在不使用 csv/feather/parquet 文件的情况下将 Pandas 数据帧传递给 R? - How to pass a pandas dataframe to R without using csv/feather/parquet files? 在numpy数组的每一行中,我都有一个int和一个Python int列表。 如何在不使用熊猫的情况下将此列表转换为numpy int数组? - In each row of a numpy array, I have an int, and a python list of ints. How do I convert this list into a numpy int array, without using pandas? 如何在不使用numpy / pandas的情况下处理csv文件中的丢失数据? - How to handle missing data in a csv file without using numpy/pandas?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM