使用多行创建 pandas dataframe 的更快方法

Question

I am reading hdf5 files with large amounts of data.我正在阅读具有大量数据的 hdf5 文件。 I want to store it in a dataframe (it will contain around 1.3e9 rows).我想将它存储在 dataframe 中（它将包含大约 1.3e9 行）。 For the moment I am using the following procedure:目前我正在使用以下程序：

df = pd.DataFrame()
for key in ['Column1', 'Column2', 'Column3']:
    df[key] = np.array(h5assembly.get(key))

I have timed it and it takes ~110 seconds我已经计时了，大约需要 110 秒

If I just assign the values to numpy arrays, like this:如果我只是将值分配给 numpy arrays，如下所示：

v1 = np.array(h5assembly.get('Column1'))
v2 = np.array(h5assembly.get('Column2'))
v3 = np.array(h5assembly.get('Column3'))

It takes ~22 seconds.大约需要 22 秒。

Am I doing something wrong?难道我做错了什么？ Is it expected that the creation of the dataframe is so much slower?预计 dataframe 的创建速度会慢很多吗？ Is there any way to accelerate this process?有什么方法可以加速这个过程吗？

Answer 1

Yes, it is expected that a DataFrame will take longer than Numpy arrays.是的，预计 DataFrame 将比 Numpy arrays 花费更长的时间。 This is due to various reasons and I won't list them all.这是由于各种原因，我不会一一列举。 Partly due to the may Numpy uses and frees up memory.部分原因是 Numpy 使用并释放了 memory。 Numpy operations are implemented in C, a compiled language giving performance benefits. Numpy 操作在 C 中实现，这是一种具有性能优势的编译语言。

An interesting comparison between pandas and Numpy performance may be seen here: https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/ pandas 和 Numpy 性能之间的有趣比较可以在这里看到： https://penandpants.com/2014/09/05/-num-py-seriespandas-series

A package that aims to speed up Pandas using parallelization is Molin: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html A package that aims to speed up Pandas using parallelization is Molin: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html

Here is also a package called 'PyPolars' which aims to work in a very similar way to Pandas with greater performance due to the implementation of Rust: https://www.analyticsvidhya.com/blog/2021/02/is-pypolars-the-new-alternative-to-pandas/ Here is also a package called 'PyPolars' which aims to work in a very similar way to Pandas with greater performance due to the implementation of Rust: https://www.analyticsvidhya.com/blog/2021/02/is-pypolars-熊猫的新替代品/

Answer 2

You can use pandas.read_hdf to read hdf files directly into a dataframe.您可以使用pandas.read_hdf将 hdf 文件直接读取到 dataframe 中。

df = pd.read_hdf('./store.h5')

使用多行创建 pandas dataframe 的更快方法

问题描述

2 个解决方案

解决方案1
1 2021-03-02 19:22:31

解决方案2
0 2021-03-02 19:26:26

使用多行创建 pandas dataframe 的更快方法

问题描述

2 个解决方案

解决方案1 1 2021-03-02 19:22:31

解决方案2 0 2021-03-02 19:26:26

解决方案1
1 2021-03-02 19:22:31

解决方案2
0 2021-03-02 19:26:26