将 Pandas DataFrame 序列化为内存缓冲区表示

Question

What is the fastest way to serialize a DataFrame to an in-memory representation?将 DataFrame 序列化为内存表示的最快方法是什么？ Based on some research, it seems to be widely acknowledged that the Apache Feather format is the fastest available format by most metrics.根据一些研究， Apache Feather格式似乎是大多数指标中最快的可用格式。

My goal is to get the serialized bytes of a DataFrame - the only issue with Feather is that I would like to avoid the overhead of writing to and loading from disk, and the Feather API seems to only allow File I/O.我的目标是获取 DataFrame 的序列化字节 - Feather 的唯一问题是我想避免写入磁盘和从磁盘加载的开销，而 Feather API 似乎只允许文件 I/O。 Is there a different format I should be looking into for this, or is there perhaps a way in Python to "fake" a file, forcing Feather to write to an in-memory buffer instead?有没有我应该研究的不同格式，或者 Python 中是否有一种方法可以“伪造”文件，迫使 Feather 改为写入内存缓冲区？

Answer 1

pyarrow provides BufferOutputStream for writing into memory instead of files. pyarrow提供BufferOutputStream用于写入 memory 而不是文件。 In constrast to the docstring, read_feather and write_feather also support reading from memory / writing into a writer interface.与文档字符串相比， read_feather和write_feather还支持从 memory 读取/写入写入器接口。

With the following code, you can serialise a DataFrame into memory without going to the filesystem and then directly reconstruct it again.使用以下代码，您可以将 DataFrame 序列化为 memory 而无需进入文件系统，然后直接重新构建它。

from pyarrow.feather import read_feather, write_feather
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"column": [1, 2]})
output_stream = pa.BufferOutputStream()
write_feather(df, output_stream)
df_reconstructed = read_feather(output_stream.getvalue())

将 Pandas DataFrame 序列化为内存缓冲区表示

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-05-27 12:41:24

将 Pandas DataFrame 序列化为内存缓冲区表示

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-05-27 12:41:24

解决方案1
3 已采纳 2020-05-27 12:41:24