有没有一种更快的方式来写入或读取/读取大约一百万行的熊猫数据帧

Question

I am trying to be really specific about my issue. 我正在尝试对我的问题进行具体说明。 I have a dataframe with some 200+ columns and 1mil+ rows. 我有一个包含200多个列和100万以上行的数据框。 I am reading or writing it to a excel file which takes more than 45 mins if I recorded right. 我正在将其读取或写入到excel文件中，如果我录制正确，则需要45分钟以上。

df = pd.read_csv("data_file.csv", low_memory=False, header=0, delimiter = ',', na_values = ('', 'nan'))
df.to_excel('data_file.xlsx', header=0, index=False)

My question- is there anyway we can read or write faster to a file with pandas dataframe because this is just one file example. 我的问题-无论如何，我们可以用pandas数据帧更快地读写文件，因为这只是一个文件示例。 I have many more such files with me 我还有更多这样的文件

Answer 1

Two thoughts: 两个想法：

Investigate Dask , which provides a Pandas like DataFrame that can distribute processing of large datasets across multiple CPUs or clusters. 研究Dask ，它提供了像DataFrame这样的熊猫，它可以在多个CPU或群集之间分配大型数据集的处理。 Hard to say to what degree you will get a speed up, if your performance is purely IO bound, but certainly worth investigating. 很难说，如果您的性能纯粹是IO约束的，您将在多大程度上加快速度，但是当然值得研究。 Take a quick look at the Dask use cases to get an understanding of its capabilities. 快速浏览一下Dask用例，以了解其功能。
If you are going to repeatedly read the same CSV input files, then I would suggest converting these to HDF, as reading HDF is orders of magnitude faster than reading the equivalent CSV file. 如果您要重复读取相同的CSV输入文件，则建议将它们转换为HDF，因为读取HDF比读取等效的CSV文件快几个数量级。 It's as simple as reading the file into a DataFrame and then writing it back out using DataFrame.to_hdf(). 就像将文件读入DataFrame然后使用DataFrame.to_hdf().写回一样简单DataFrame.to_hdf(). Obviously this will only help if you can do this conversion as a once off exercise, and then use the HDF files from that point forward whenever you run your code. 显然，这只有在您可以一次完成一次转换之后再在每次运行代码时使用该点之后的HDF文件的情况下才有用。

Regards, Ian 问候，伊恩

Answer 2

That is a big file you are working with. 您正在使用的文件很大。 If you need to process the data then you can't really get around the long read and write times. 如果您需要处理数据，那么您将无法避免漫长的读写时间。

Answer 3

Do NOT write to xlsx, use csv, writing to xlsx is taking long time. 不要写xlsx，使用csv，写xlsx会花费很长时间。 Write to csv. 写入csv。 It takes a minute on my cheap laptop with SSD. 我的廉价SSD笔记本电脑需要一分钟。

有没有一种更快的方式来写入或读取/读取大约一百万行的熊猫数据帧

问题描述

3 个解决方案

解决方案1
2 2018-07-12 12:50:52

解决方案2
0 2018-07-12 12:33:01

解决方案3
0 2018-07-12 20:09:14

有没有一种更快的方式来写入或读取/读取大约一百万行的熊猫数据帧

问题描述

3 个解决方案

解决方案1 2 2018-07-12 12:50:52

解决方案2 0 2018-07-12 12:33:01

解决方案3 0 2018-07-12 20:09:14

解决方案1
2 2018-07-12 12:50:52

解决方案2
0 2018-07-12 12:33:01

解决方案3
0 2018-07-12 20:09:14