简体   繁体   English

使用多行创建 pandas dataframe 的更快方法

[英]Faster way to create pandas dataframe with many rows

I am reading hdf5 files with large amounts of data.我正在阅读具有大量数据的 hdf5 文件。 I want to store it in a dataframe (it will contain around 1.3e9 rows).我想将它存储在 dataframe 中(它将包含大约 1.3e9 行)。 For the moment I am using the following procedure:目前我正在使用以下程序:

df = pd.DataFrame()
for key in ['Column1', 'Column2', 'Column3']:
    df[key] = np.array(h5assembly.get(key))

I have timed it and it takes ~110 seconds我已经计时了,大约需要 110 秒

If I just assign the values to numpy arrays, like this:如果我只是将值分配给 numpy arrays,如下所示:

v1 = np.array(h5assembly.get('Column1'))
v2 = np.array(h5assembly.get('Column2'))
v3 = np.array(h5assembly.get('Column3'))

It takes ~22 seconds.大约需要 22 秒。

Am I doing something wrong?难道我做错了什么? Is it expected that the creation of the dataframe is so much slower?预计 dataframe 的创建速度会慢很多吗? Is there any way to accelerate this process?有什么方法可以加速这个过程吗?

Yes, it is expected that a DataFrame will take longer than Numpy arrays.是的,预计 DataFrame 将比 Numpy arrays 花费更长的时间。 This is due to various reasons and I won't list them all.这是由于各种原因,我不会一一列举。 Partly due to the may Numpy uses and frees up memory.部分原因是 Numpy 使用并释放了 memory。 Numpy operations are implemented in C, a compiled language giving performance benefits. Numpy 操作在 C 中实现,这是一种具有性能优势的编译语言。

An interesting comparison between pandas and Numpy performance may be seen here: https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/ pandas 和 Numpy 性能之间的有趣比较可以在这里看到: https://penandpants.com/2014/09/05/-num-py-seriespandas-series

A package that aims to speed up Pandas using parallelization is Molin: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html A package that aims to speed up Pandas using parallelization is Molin: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html

Here is also a package called 'PyPolars' which aims to work in a very similar way to Pandas with greater performance due to the implementation of Rust: https://www.analyticsvidhya.com/blog/2021/02/is-pypolars-the-new-alternative-to-pandas/ Here is also a package called 'PyPolars' which aims to work in a very similar way to Pandas with greater performance due to the implementation of Rust: https://www.analyticsvidhya.com/blog/2021/02/is-pypolars-熊猫的新替代品/

You can use pandas.read_hdf to read hdf files directly into a dataframe.您可以使用pandas.read_hdf将 hdf 文件直接读取到 dataframe 中。

df = pd.read_hdf('./store.h5')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM