使用openpyxl python处理非常大的文件

Question

I have a spreadsheet with 11,000 rows and 10 columns. 我有一个包含11,000行和10列的电子表格。 I am trying to copy each row with selected columns, add additional information per line and output to a txt. 我正在尝试复制具有选定列的每一行，每行添加其他信息，然后输出到txt。

Unfortunately, I am having really bad performance issues, files start to slug after 100 rows and kill my processor. 不幸的是，我遇到了非常糟糕的性能问题，文件在经过100行之后开始塞住并杀死了我的处理器。 Is there a way to speed this up or use better methodology? 有没有办法加快速度或使用更好的方法？ I am already using read_only=True and data_only=True 我已经在使用read_only=True和data_only=True

Most memory intensive part is iterating through each cell : 占用大量内存的部分是遍历每个单元：

for i in range(probeStart, lastRow+1):
    dataRow =""
    for j in range (1,col+2):
        dataRow = dataRow + str(sheet.cell(row=i, column=j).value)  + "\t"

    sigP = db.get(str(sheet.cell(row= i, column=1).value), "notfound") #my additional information 
    a = str(sheet.cell(row = i, column = max_column-1).value) +"\t" 
    b  = str(sheet.cell(row = i, column = max_column).value) + "\t"
    string1  = dataRow + a + b + sigP + "\n"
    w.write(string1)

Answer 1

Question : Is there a way to speed this up or use better methodology? 问题：是否可以加快速度或使用更好的方法？

Try the following to see if this improve performance: 请尝试以下操作，看是否可以提高性能：

Note : Didn't know the Values of col and max_column ! 注意：不知道col和max_column的值！
My Example uses 4 Columns and skips Column C. 我的示例使用4列，并跳过列C。

Data : 资料：
['A1', 'B1', 'C1', 'D1'], ['A1'，'B1'，'C1'，'D1']，
['A2', 'B2', 'C2', 'D2'] ['A2'，'B2'，'C2'，'D2']

from openpyxl.utils import range_boundaries
min_col, min_row, max_col, max_row = range_boundaries('A1:D2')

for row_cells in ws.iter_rows(min_col=min_col, min_row=min_row,
                              max_col=max_col, max_row=max_row):

    # Slice Column Values up to B
    data = [cell.value for cell in row_cells[:2]]

    # Extend List with sliced Column Values from D up to End
    data.extend([cell.value for cell in row_cells[3:]])

    # Append db.get(Column A.value)
    data.append(db.get(row_cells[0].value, "notfound"))

    # Join all List Values delimited with \t
    print('{}'.format('\t'.join(data)))

    # Write to CSV
    #w.write(data)

Output : 输出：
A1 B1 D1 notfound 找不到A1 B1 D1
A2 B2 D2 notfound 找不到A2 B2 D2

Tested with Python: 3.4.2 - openpyxl: 2.4.1 使用Python测试：3.4.2-openpyxl：2.4.1

使用openpyxl python处理非常大的文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-08-02 16:06:42

使用openpyxl python处理非常大的文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-08-02 16:06:42

解决方案1
1 已采纳 2017-08-02 16:06:42