简体   繁体   English

Pyarrow:将stream读入pandas dataframe高ZCD69B4957F06CD818D7BF3DEDE6198

[英]Pyarrow: read stream into pandas dataframe high memory consumption

I would like to to first write a stream into an arrow file and then later read it back into a pandas dataframe, with as little memory overhead as posible. I would like to to first write a stream into an arrow file and then later read it back into a pandas dataframe, with as little memory overhead as posible.

Writing data in batches works perfectly fine:批量写入数据效果很好:

import pyarrow as pa
import pandas as pd
import random

data = [pa.array([random.randint(0, 1000)]), pa.array(['B']), pa.array(['C'])]
columns = ['A','B','C']
batch = pa.RecordBatch.from_arrays(data, columns)

with pa.OSFile('test.arrow', 'wb') as f:
    with pa.RecordBatchStreamWriter(f, batch.schema) as writer:
        for i in range(1000 * 1000):
            data = [pa.array([random.randint(0, 1000)]), pa.array(['B']), pa.array(['C'])]
            batch = pa.RecordBatch.from_arrays(data, columns)
            writer.write_batch(batch)

Writing 1 million rows as above is fast and uses about 40MB memory during the entire write.如上所述写入 100 万行速度很快,并且在整个写入过程中使用了大约 40MB memory。 This is perfectly fine.这很好。

However reading back is not fine since the memory consumption goes up to 2GB, before producing the final dataframe which is about 118MB.但是回读并不好,因为 memory 消耗高达 2GB,然后才生成最终的 dataframe,大约为 118MB。

I tried this:我试过这个:

with pa.input_stream('test.arrow') as f:
    reader = pa.BufferReader(f.read())
    table = pa.ipc.open_stream(reader).read_all()
    df1 = table.to_pandas(split_blocks=True, self_destruct=True)

and this, with the same memory overhead:而这个,具有相同的 memory 开销:

with open('test.arrow', 'rb') as f:
   df1 = pa.ipc.open_stream(f).read_pandas()

Dataframe size: Dataframe 尺寸:

print(df1.info(memory_usage='deep'))

Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype
---  ------  --------------    -----
 0   A       1000000 non-null  int64
 1   B       1000000 non-null  object
 2   C       1000000 non-null  object
dtypes: int64(1), object(2)
memory usage: 118.3 MB
None

What I would need is either to fix the memory usage with pyarrow or a suggestion which other format I could use to write data into incrementally, then read all of it into a pandas dataframe and without too much memory overhead. What I would need is either to fix the memory usage with pyarrow or a suggestion which other format I could use to write data into incrementally, then read all of it into a pandas dataframe and without too much memory overhead.

Your example is using many RecordBatches each of a single row.您的示例在单行中使用了许多 RecordBatches。 Such a RecordBatch has some overhead in addition to just the data (the schema, potential padding/alignment), and thus is not efficient for storing only a single row.除了数据(模式、潜在的填充/对齐)之外,这样的 RecordBatch 还具有一些开销,因此对于仅存储单行来说效率不高。

When reading the file with read_all() or read_pandas() , it first creates all those RecordBatches before converting them to a single Table.使用read_all()read_pandas()读取文件时,它首先创建所有这些 RecordBatches,然后再将它们转换为单个表。 And then the overhead adds up, and this is what you are seeing.然后开销加起来,这就是你所看到的。

The recommended size for a RecordBatch is of course dependent on the exact use case, but a typical size is 64k to 1M rows. RecordBatch 的推荐大小当然取决于具体的用例,但典型的大小是 64k 到 1M 行。


To see the effect of the padding to 64 bytes per array ( https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding ), let's check the total allocated bytes vs the actual bytes represented by the RecordBatch:要查看填充到每个数组 64 字节的效果( https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding ),让我们检查分配的总字节数与实际字节数由 RecordBatch 表示:

import pyarrow as pa
 
batch = pa.RecordBatch.from_arrays(
    [pa.array([1]), pa.array(['B']), pa.array(['C'])],
    ['A','B','C']
)

# The size of the data stored in the RecordBatch
# 8 for the integer (int64), 9 for each string array (8 for the len-2 offset array (int32), 1 for the single string byte)
>>> batch.nbytes
26

# The size of the data actually being allocated by Arrow
# (5*64 for 5 buffers padded to 64 bytes)
>>> pa.total_allocated_bytes()
320

So you can see that just this padding already gives a big overhead for a tiny RecordBatch所以你可以看到,仅仅这个填充就已经给一个小的 RecordBatch 带来了很大的开销

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在字典上加载多个 Pandas 数据帧时,Dask 高内存消耗 - Dask high memory consumption when loading multiple Pandas dataframes on dictionary 通过使用每列的缩放数据来减少 Pandas DataFrame 内存消耗? - Reducing pandas DataFrame memory consumption by use of scaled data for each column? 为什么python pandas dataFrame的内存消耗这么大? - Why the memory consumption of python pandas dataFrame is so big? Parquet 文件大于 memory 消耗 pandas DataFrame - Parquet file larger than memory consumption of pandas DataFrame 熊猫 - 巨大的内存消耗 - Pandas - Huge memory consumption 熊猫DataFrame的RAM消耗 - RAM consumption by pandas DataFrame 如何使用 pyarrow 从 S3 读取镶木地板文件列表作为 pandas dataframe? - How to read a list of parquet files from S3 as a pandas dataframe using pyarrow? 为什么 Pyarrow 可以读取额外的索引列而 Pandas dataframe 不能? - Why can Pyarrow read additional index column while Pandas dataframe cannot? jupyter笔记本中的IPython:用pandas读取大数据文件变得非常慢(内存消耗高吗?) - IPython in jupyter notebooks: reading a large datafile with pandas becomes very slow (high memory consumption?) 用 pyarrow 将大熊猫数据框写为镶木地板 - Write large pandas dataframe as parquet with pyarrow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM