熊猫read_pickle缓慢

Question

I have Python 3.4 with Pandas 0.17. 我有Pandas 0.17的Python 3.4。 I noticed that my program takes ~30 seconds to read a pickle file. 我注意到我的程序需要30秒钟才能读取泡菜文件。

df= pd.read_csv(a, skiprows=[1])
df.to_pickle(b)
df2 = pd.read_pickle(b)  --- This line takes almost 30 seconds.

The original csv file is ~185 MB (2967000 lines) and the pickle file is 125 MB. 原始的csv文件为〜185 MB（2967000行），而pickle文件为125 MB。

I have another pickle file (~95 MB) which is working fine (can be read in <1 sec). 我还有另一个泡菜文件（〜95 MB），可以正常工作（可以在<1秒内读取）。 Any suggestions? 有什么建议么？

Answer 1

I found a way to resolve the issue. 我找到了解决问题的方法。 My pickle file was being created by a cronjob from root user. 我的泡菜文件是由root用户的cronjob创建的。 Python program was developed in virtual environment. Python程序是在虚拟环境中开发的。 Global environment does not have pandas. 全球环境没有大熊猫。 So when the root user ran the cronjob, it created pickle file successfully but something was wrong with this file. 因此，当root用户运行cronjob时，它成功创建了pickle文件，但是该文件有问题。 I modified the cronjob to use python binary from my virtualenv and that that fixed the issue. 我修改了cronjob以从virtualenv使用python二进制文件，从而解决了该问题。 I can see the size difference in pickle file created by global python and virtualenv python. 我可以看到由全局python和virtualenv python创建的pickle文件的大小差异。

I am still not sure how root user was able to run the python file while it does not have pandas available. 我仍然不确定root用户如何在没有可用熊猫的情况下运行python文件。

熊猫read_pickle缓慢

问题描述

1 个解决方案

解决方案1
0 2016-06-15 18:13:54

熊猫read_pickle缓慢

问题描述

1 个解决方案

解决方案1 0 2016-06-15 18:13:54

解决方案1
0 2016-06-15 18:13:54