Pyarrow：将stream读入pandas dataframe高ZCD69B4957F06CD818D7BF3DEDE6198

Question

I would like to to first write a stream into an arrow file and then later read it back into a pandas dataframe, with as little memory overhead as posible. I would like to to first write a stream into an arrow file and then later read it back into a pandas dataframe, with as little memory overhead as posible.

Writing data in batches works perfectly fine:批量写入数据效果很好：

import pyarrow as pa
import pandas as pd
import random

data = [pa.array([random.randint(0, 1000)]), pa.array(['B']), pa.array(['C'])]
columns = ['A','B','C']
batch = pa.RecordBatch.from_arrays(data, columns)

with pa.OSFile('test.arrow', 'wb') as f:
    with pa.RecordBatchStreamWriter(f, batch.schema) as writer:
        for i in range(1000 * 1000):
            data = [pa.array([random.randint(0, 1000)]), pa.array(['B']), pa.array(['C'])]
            batch = pa.RecordBatch.from_arrays(data, columns)
            writer.write_batch(batch)

Writing 1 million rows as above is fast and uses about 40MB memory during the entire write.如上所述写入 100 万行速度很快，并且在整个写入过程中使用了大约 40MB memory。 This is perfectly fine.这很好。

However reading back is not fine since the memory consumption goes up to 2GB, before producing the final dataframe which is about 118MB.但是回读并不好，因为 memory 消耗高达 2GB，然后才生成最终的 dataframe，大约为 118MB。

I tried this:我试过这个：

with pa.input_stream('test.arrow') as f:
    reader = pa.BufferReader(f.read())
    table = pa.ipc.open_stream(reader).read_all()
    df1 = table.to_pandas(split_blocks=True, self_destruct=True)

and this, with the same memory overhead:而这个，具有相同的 memory 开销：

with open('test.arrow', 'rb') as f:
   df1 = pa.ipc.open_stream(f).read_pandas()

Dataframe size: Dataframe 尺寸：

print(df1.info(memory_usage='deep'))

Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype
---  ------  --------------    -----
 0   A       1000000 non-null  int64
 1   B       1000000 non-null  object
 2   C       1000000 non-null  object
dtypes: int64(1), object(2)
memory usage: 118.3 MB
None

What I would need is either to fix the memory usage with pyarrow or a suggestion which other format I could use to write data into incrementally, then read all of it into a pandas dataframe and without too much memory overhead. What I would need is either to fix the memory usage with pyarrow or a suggestion which other format I could use to write data into incrementally, then read all of it into a pandas dataframe and without too much memory overhead.

Answer 1

Your example is using many RecordBatches each of a single row.您的示例在单行中使用了许多 RecordBatches。 Such a RecordBatch has some overhead in addition to just the data (the schema, potential padding/alignment), and thus is not efficient for storing only a single row.除了数据（模式、潜在的填充/对齐）之外，这样的 RecordBatch 还具有一些开销，因此对于仅存储单行来说效率不高。

When reading the file with read_all() or read_pandas() , it first creates all those RecordBatches before converting them to a single Table.使用read_all()或read_pandas()读取文件时，它首先创建所有这些 RecordBatches，然后再将它们转换为单个表。 And then the overhead adds up, and this is what you are seeing.然后开销加起来，这就是你所看到的。

The recommended size for a RecordBatch is of course dependent on the exact use case, but a typical size is 64k to 1M rows. RecordBatch 的推荐大小当然取决于具体的用例，但典型的大小是 64k 到 1M 行。

To see the effect of the padding to 64 bytes per array ( https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding ), let's check the total allocated bytes vs the actual bytes represented by the RecordBatch:要查看填充到每个数组 64 字节的效果（ https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding ），让我们检查分配的总字节数与实际字节数由 RecordBatch 表示：

import pyarrow as pa
 
batch = pa.RecordBatch.from_arrays(
    [pa.array([1]), pa.array(['B']), pa.array(['C'])],
    ['A','B','C']
)

# The size of the data stored in the RecordBatch
# 8 for the integer (int64), 9 for each string array (8 for the len-2 offset array (int32), 1 for the single string byte)
>>> batch.nbytes
26

# The size of the data actually being allocated by Arrow
# (5*64 for 5 buffers padded to 64 bytes)
>>> pa.total_allocated_bytes()
320

So you can see that just this padding already gives a big overhead for a tiny RecordBatch所以你可以看到，仅仅这个填充就已经给一个小的 RecordBatch 带来了很大的开销

Pyarrow：将stream读入pandas dataframe高ZCD69B4957F06CD818D7BF3DEDE6198

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-05-17 13:16:03

Pyarrow：将stream读入pandas dataframe高ZCD69B4957F06CD818D7BF3DEDE6198

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-05-17 13:16:03

解决方案1
2 已采纳 2021-05-17 13:16:03