简体   繁体   English

如何更快地打开从pandas创建的excel文件?

[英]How to open the excel file creating from pandas faster?

The excel file creating from python is extremely slow to open even the size of file is about 50 mb. 从python创建的excel文件打开速度极慢,即使文件大小约为50 MB。

I have tried on both pandas and openpyxl. 我试过了pandas和openpyxl。

def to_file(list_report,list_sheet,strip_columns,Name):
    i = 0
    wb = ExcelWriter(path_output + '\\' + Name + dateformat + '.xlsx')
    while i <= len(list_report)-1:
        try:
            df = pd.DataFrame(pd.read_csv(path_input + '\\' + list_report[i] + reportdate + '.csv'))
            for column in strip_column:
                try:
                    df[column] = df[column].str.strip('=("")')
                except:
                    pass
            df = adjust_report(df,list_report[i])
            df = df.apply(pd.to_numeric, errors ='ignore', downcast = 'integer')
            df.to_excel(wb, sheet_name = list_sheet[i], index = False)
        except:
            print('Missing report: ' + list_report[i])
        i += 1
    wb.save()

Is there anyway to speed it up? 反正有加速吗?

idiom 成语

Let us rename list_report to reports . 让我们将list_report重命名为reports Then your while loop is usually expressed as simply: for i in range(len(reports)): 那么你的while循环通常表示为: for i in range(len(reports)):

You access the i -th element several times. 您可以多次访问第i个元素。 The loop could bind that for you, with: for i, report in enumerate(reports): . 循环可以为你绑定,对于: for i, report in enumerate(reports):

But it turns out you never even need i . 但事实证明你甚至不需要i So most folks would write this as: for report in reports: 所以大多数人都会这样写: for report in reports:

code organization 代码组织

This bit of code is very nice: 这段代码非常好:

        for column in strip_column:
            try:
                df[column] = df[column].str.strip('=("")')
            except:
                pass

I recommend you bury it in a helper function, using def strip_punctuation . 我建议你使用def strip_punctuation将它埋没在辅助函数中。 (The list should be plural, I think? strip_columns ?) Then you would have a simple sequence of df assignments. (列表应该是复数,我想? strip_columns ?)然后你会有一个简单的df赋值序列。

timing 定时

Profile elapsed time() . 配置文件已用time() Surround each df assignment with code like this: 使用以下代码围绕每个df赋值:

    t0 = time()
    df = ...
    print(time() - t0)

That will show you which part of your processing pipeline takes the longest and therefore should receive the most effort for speeding it up. 这将显示您的处理管道的哪个部分耗时最长,因此应该尽最大努力加快它的速度。

I suspect adjust_report() uses the bulk of the time, but without seeing it that's hard to say. 我怀疑adjust_report()使用了大部分时间,但没有看到它很难说。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM