如何更快地打开从pandas创建的excel文件？

Question

The excel file creating from python is extremely slow to open even the size of file is about 50 mb. 从python创建的excel文件打开速度极慢，即使文件大小约为50 MB。

I have tried on both pandas and openpyxl. 我试过了pandas和openpyxl。

def to_file(list_report,list_sheet,strip_columns,Name):
    i = 0
    wb = ExcelWriter(path_output + '\\' + Name + dateformat + '.xlsx')
    while i <= len(list_report)-1:
        try:
            df = pd.DataFrame(pd.read_csv(path_input + '\\' + list_report[i] + reportdate + '.csv'))
            for column in strip_column:
                try:
                    df[column] = df[column].str.strip('=("")')
                except:
                    pass
            df = adjust_report(df,list_report[i])
            df = df.apply(pd.to_numeric, errors ='ignore', downcast = 'integer')
            df.to_excel(wb, sheet_name = list_sheet[i], index = False)
        except:
            print('Missing report: ' + list_report[i])
        i += 1
    wb.save()

Is there anyway to speed it up? 反正有加速吗？

Answer 1

idiom 成语

Let us rename list_report to reports . 让我们将list_report重命名为reports 。 Then your while loop is usually expressed as simply: for i in range(len(reports)): 那么你的while循环通常表示为： for i in range(len(reports)):

You access the i -th element several times. 您可以多次访问第i个元素。 The loop could bind that for you, with: for i, report in enumerate(reports): . 循环可以为你绑定，对于： for i, report in enumerate(reports):

But it turns out you never even need i . 但事实证明你甚至不需要i 。 So most folks would write this as: for report in reports: 所以大多数人都会这样写： for report in reports:

code organization 代码组织

This bit of code is very nice: 这段代码非常好：

        for column in strip_column:
            try:
                df[column] = df[column].str.strip('=("")')
            except:
                pass

I recommend you bury it in a helper function, using def strip_punctuation . 我建议你使用def strip_punctuation将它埋没在辅助函数中。 (The list should be plural, I think? strip_columns ?) Then you would have a simple sequence of df assignments. （列表应该是复数，我想？ strip_columns ？）然后你会有一个简单的df赋值序列。

timing 定时

Profile elapsed time() . 配置文件已用time() 。 Surround each df assignment with code like this: 使用以下代码围绕每个df赋值：

    t0 = time()
    df = ...
    print(time() - t0)

That will show you which part of your processing pipeline takes the longest and therefore should receive the most effort for speeding it up. 这将显示您的处理管道的哪个部分耗时最长，因此应该尽最大努力加快它的速度。

I suspect adjust_report() uses the bulk of the time, but without seeing it that's hard to say. 我怀疑adjust_report()使用了大部分时间，但没有看到它很难说。

如何更快地打开从pandas创建的excel文件？

问题描述

1 个解决方案

解决方案1
0 2019-03-26 15:11:07

idiom 成语

code organization 代码组织

timing 定时

如何更快地打开从pandas创建的excel文件？

问题描述

1 个解决方案

解决方案1 0 2019-03-26 15:11:07

idiom 成语

code organization 代码组织

timing 定时

解决方案1
0 2019-03-26 15:11:07