简体   繁体   English

有没有一种更快的方式来写入或读取/读取大约一百万行的熊猫数据帧

[英]is there a Faster way to write or read in/to with pandas data frame with about 1 million row

I am trying to be really specific about my issue. 我正在尝试对我的问题进行具体说明。 I have a dataframe with some 200+ columns and 1mil+ rows. 我有一个包含200多个列和100万以上行的数据框。 I am reading or writing it to a excel file which takes more than 45 mins if I recorded right. 我正在将其读取或写入到excel文件中,如果我录制正确,则需要45分钟以上。

df = pd.read_csv("data_file.csv", low_memory=False, header=0, delimiter = ',', na_values = ('', 'nan'))
df.to_excel('data_file.xlsx', header=0, index=False)

My question- is there anyway we can read or write faster to a file with pandas dataframe because this is just one file example. 我的问题-无论如何,我们可以用pandas数据帧更快地读写文件,因为这只是一个文件示例。 I have many more such files with me 我还有更多这样的文件

Two thoughts: 两个想法:

  • Investigate Dask , which provides a Pandas like DataFrame that can distribute processing of large datasets across multiple CPUs or clusters. 研究Dask ,它提供了像DataFrame这样的熊猫,它可以在多个CPU或群集之间分配大型数据集的处理。 Hard to say to what degree you will get a speed up, if your performance is purely IO bound, but certainly worth investigating. 很难说,如果您的性能纯粹是IO约束的,您将在多大程度上加快速度,但是当然值得研究。 Take a quick look at the Dask use cases to get an understanding of its capabilities. 快速浏览一下Dask用例 ,以了解其功能。

  • If you are going to repeatedly read the same CSV input files, then I would suggest converting these to HDF, as reading HDF is orders of magnitude faster than reading the equivalent CSV file. 如果您要重复读取相同的CSV输入文件,则建议将它们转换为HDF,因为读取HDF比读取等效的CSV文件快几个数量级。 It's as simple as reading the file into a DataFrame and then writing it back out using DataFrame.to_hdf(). 就像将文件读入DataFrame然后使用DataFrame.to_hdf().写回一样简单DataFrame.to_hdf(). Obviously this will only help if you can do this conversion as a once off exercise, and then use the HDF files from that point forward whenever you run your code. 显然,这只有在您可以一次完成一次转换之后再在每次运行代码时使用该点之后的HDF文件的情况下才有用。

Regards, Ian 问候,伊恩

That is a big file you are working with. 您正在使用的文件很大。 If you need to process the data then you can't really get around the long read and write times. 如果您需要处理数据,那么您将无法避免漫长的读写时间。

Do NOT write to xlsx, use csv, writing to xlsx is taking long time. 不要写xlsx,使用csv,写xlsx会花费很长时间。 Write to csv. 写入csv。 It takes a minute on my cheap laptop with SSD. 我的廉价SSD笔记本电脑需要一分钟。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Pandas 数据框中的列上应用函数的更快方法 - Faster way to apply a function over a column in pandas data frame 在 pandas 数据帧中连接两个系列的更快方法 - Faster way to concatenate two series in a pandas data frame 在 1000 万个模式的 Pandas 数据框上执行 str.contains 并为每个模式获取匹配的有效方法 - Efficient way to do str.contains on pandas data frame for 10 million patterns and get matches for each 将4行数据读入一行pandas数据帧 - Read 4 lines of data into one row of pandas data frame 如何在Pandas中读取带有行名称的数据框的CSV文件 - How to read CSV file with of data frame with row names in Pandas append 到 pandas 数据帧的新行的最快方法 - Fastest way to append a new row to a pandas data frame 有没有更快的方法来读取以下熊猫数据框? - Is there a faster way to read following pandas dataframe? 将 Excel 文件读取到 Pandas 数据框的更快方法 - Faster way to read Excel files to pandas dataframe 有没有更快的方法将来自 Teradata 的 SQL 查询的结果放入 pandas 数据帧中? - Is there a faster way to put the result of a SQL query from Teradata into a pandas Data Frame? 根据另一列的值更新熊猫数据框中的一列的更快方法 - Faster way to update a column in a pandas data frame based on the value of another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM