简体   繁体   English

Pandas df.to_excel 速度很慢,有没有办法加快速度?

[英]Pandas df.to_excel is way to slow, is there anyway to speed it up?

I'm working with a set of data which has almost 60 columns (Text/Address/Numbers).我正在处理一组包含近 60 列(文本/地址/数字)的数据。 After processing the data using Pandas , I have to export it to xlsx format.使用Pandas处理数据后,我必须将其导出为xlsx格式。

This is how I'm generating the output:这就是我生成 output 的方式:

with pd.ExcelWriter("output.xlsx", engine='xlsxwriter') as writer:
    df.to_excel(writer, sheet_name="sheet", index=False)

I have also tried this method:我也试过这个方法:

df.to_excel('output.xlsx', index=False, engine='xlsxwriter')

What I have noticed is, generating xlsx is remarkably slower than a format like csv .我注意到的是,生成xlsx比像csv这样的格式要慢得多。 And as the number of records grows, the time of generating the xlsx file significantly increases.并且随着记录数量的增加,生成xlsx文件的时间显着增加。

在此处输入图像描述

Is this the normal and expected behavior of .to_excel or there is something wrong here?这是.to_excel的正常和预期行为还是这里有问题? is there any way to debug and solve this problem?有没有办法调试和解决这个问题?


I have to be able to generate xlsx files for ~300K to ~600K records in matter of seconds, but as you can see it takes me around 6 minutes to generate an excel file for about 500K records.我必须能够在几秒钟内为 ~300K 到 ~600K 记录生成xlsx文件,但正如您所见,我需要大约 6 分钟才能为大约 500K 记录生成 excel 文件。

The hardware that I'm using to generate these files has 16 Core of CPU, and 64 GB of memory.我用来生成这些文件的硬件有 16 个 CPU 核心和 64 GB 的 memory。

I found a slightly faster solution, than just using engine='xlsxwriter'我找到了一个比仅使用engine='xlsxwriter'稍微快一点的解决方案

import pandas as pd
from xlsxwriter.workbook import Workbook

def export_excel(df: pd.DataFrame, file_path_out: str):
    workbook = Workbook(file_path_out)
    worksheet = workbook.add_worksheet()

    worksheet.write_row(0, 0, [col for col in df.columns])

    for index, row in df.iterrows():
        worksheet.write_row(index+1, 0, [col for col in row])
    workbook.close()

The following table shows the runtime for three methods for 100 to 50k rows of random data (with 60 cols).下表显示了 100 到 50k 行随机数据(60 列)的三种方法的运行时间。 The measured time is in seconds.测量的时间以秒为单位。 openpyxl and xlsxwriter are df.export_excel(engine=...) calls and export_excel is my proposed code, above. openpyxl 和 xlsxwriter 是 df.export_excel(engine=...) 调用,而 export_excel 是我建议的代码,上面。 Its not a game changer... but its faster它不会改变游戏规则……但它更快

row count: openpyxl, xlsxwriter, export_excel
100: 0.1272597312927246, 0.15707993507385254, 0.12616825103759766
1000: 1.1917698383331299, 0.8460557460784912, 0.7760021686553955
10000: 12.29027795791626, 8.1415114402771, 6.129252195358276
25000: 32.34258818626404, 23.32529616355896, 18.124294996261597
50000: 63.35693168640137, 40.77235984802246, 30.406764268875122

using workbook = Workbook(file_path_out, {'constant_memory': True}) will improve the runtimes even more (but not really much: 1s faster for 25k rows=使用workbook = Workbook(file_path_out, {'constant_memory': True})将进一步改善运行时间(但不是很多:25k 行快 1 秒=

According to pandas.DataFrame.to_excel docs engine value might be either openpyxl or xlsxwriter , as you use latter one I suggest to test engine='openpyxl' vs engine='xlsxwriter' .根据pandas.DataFrame.to_excel文档engine值可能是openpyxlxlsxwriter ,因为您使用后一个我建议测试engine='openpyxl' vs engine='xlsxwriter'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM