[英]Pandas df.to_excel is way to slow, is there anyway to speed it up?
I'm working with a set of data which has almost 60 columns (Text/Address/Numbers).我正在处理一组包含近 60 列(文本/地址/数字)的数据。 After processing the data using Pandas , I have to export it to xlsx
format.使用Pandas处理数据后,我必须将其导出为xlsx
格式。
This is how I'm generating the output:这就是我生成 output 的方式:
with pd.ExcelWriter("output.xlsx", engine='xlsxwriter') as writer:
df.to_excel(writer, sheet_name="sheet", index=False)
I have also tried this method:我也试过这个方法:
df.to_excel('output.xlsx', index=False, engine='xlsxwriter')
What I have noticed is, generating xlsx
is remarkably slower than a format like csv
.我注意到的是,生成xlsx
比像csv
这样的格式要慢得多。 And as the number of records grows, the time of generating the xlsx
file significantly increases.并且随着记录数量的增加,生成xlsx
文件的时间显着增加。
Is this the normal and expected behavior of .to_excel
or there is something wrong here?这是.to_excel
的正常和预期行为还是这里有问题? is there any way to debug and solve this problem?有没有办法调试和解决这个问题?
I have to be able to generate xlsx
files for ~300K to ~600K records in matter of seconds, but as you can see it takes me around 6 minutes to generate an excel file for about 500K records.我必须能够在几秒钟内为 ~300K 到 ~600K 记录生成xlsx
文件,但正如您所见,我需要大约 6 分钟才能为大约 500K 记录生成 excel 文件。
The hardware that I'm using to generate these files has 16 Core of CPU, and 64 GB of memory.我用来生成这些文件的硬件有 16 个 CPU 核心和 64 GB 的 memory。
I found a slightly faster solution, than just using engine='xlsxwriter'
我找到了一个比仅使用engine='xlsxwriter'
稍微快一点的解决方案
import pandas as pd
from xlsxwriter.workbook import Workbook
def export_excel(df: pd.DataFrame, file_path_out: str):
workbook = Workbook(file_path_out)
worksheet = workbook.add_worksheet()
worksheet.write_row(0, 0, [col for col in df.columns])
for index, row in df.iterrows():
worksheet.write_row(index+1, 0, [col for col in row])
workbook.close()
The following table shows the runtime for three methods for 100 to 50k rows of random data (with 60 cols).下表显示了 100 到 50k 行随机数据(60 列)的三种方法的运行时间。 The measured time is in seconds.测量的时间以秒为单位。 openpyxl and xlsxwriter are df.export_excel(engine=...) calls and export_excel is my proposed code, above. openpyxl 和 xlsxwriter 是 df.export_excel(engine=...) 调用,而 export_excel 是我建议的代码,上面。 Its not a game changer... but its faster它不会改变游戏规则……但它更快
row count: openpyxl, xlsxwriter, export_excel
100: 0.1272597312927246, 0.15707993507385254, 0.12616825103759766
1000: 1.1917698383331299, 0.8460557460784912, 0.7760021686553955
10000: 12.29027795791626, 8.1415114402771, 6.129252195358276
25000: 32.34258818626404, 23.32529616355896, 18.124294996261597
50000: 63.35693168640137, 40.77235984802246, 30.406764268875122
using workbook = Workbook(file_path_out, {'constant_memory': True})
will improve the runtimes even more (but not really much: 1s faster for 25k rows=使用workbook = Workbook(file_path_out, {'constant_memory': True})
将进一步改善运行时间(但不是很多:25k 行快 1 秒=
According to pandas.DataFrame.to_excel
docs engine
value might be either openpyxl
or xlsxwriter
, as you use latter one I suggest to test engine='openpyxl'
vs engine='xlsxwriter'
.根据pandas.DataFrame.to_excel
文档engine
值可能是openpyxl
或xlsxwriter
,因为您使用后一个我建议测试engine='openpyxl'
vs engine='xlsxwriter'
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.