简体   繁体   English

提高excel文件操作的速度(使用openpyxl):如果有条件检查值和删除行操作

[英]Increase the speed of an excel file operations (using openpyxl): check value and delete rows operations if condition

I have a medium size excel file, with about 25000 rows.我有一个中等大小的 excel 文件,大约有 25000 行。

In the excel file I check if a specific column value is in a list, and if is in the list I delete the row.在 excel 文件中,我检查特定列值是否在列表中,如果在列表中,我将删除该行。

I'm using openpyxl.我正在使用 openpyxl。

The code:编码:

   count = 1
    while count <= ws.max_row:
        if ws.cell(row=count, column=2).value in remove_list:
            ws.delete_rows(count, 1)
        else:
            count += 1
    wb.save(src)

The code works, but is very slow(take hours) to finish.该代码有效,但完成速度非常慢(需要数小时)。

I know that is a read-only and write-only modes, but in my case, I use both, first checking and second deleting.我知道这是只读和只写模式,但在我的情况下,我同时使用,首先检查和第二次删除。

I see you are using a list of rows which you need to delete.我看到您正在使用需要删除的行列表。 Instead, you can create "sequences" of rows to delete, thus changing a delete list like [2,3,4,5,6,7,8,45,46,47,48] to one like [[2, 7],[45, 4]]相反,您可以创建要删除的行的“序列”,从而将像 [2,3,4,5,6,7,8,45,46,47,48] 这样的删除列表更改为 [[2, 7] ],[45, 4]]

ie Delete 7 rows starting at row 2, then delete 4 rows starting at row 45即从第 2 行开始删除 7 行,然后从第 45 行开始删除 4 行

Deleting in bulk is faster than 1 by 1. I deleted 6k rows in around 10 seconds批量删除比 1 x 1 快。我在大约 10 秒内删除了 6k 行

The following code will convert a list to a list of lists/sequences:以下代码将列表转换为列表/序列列表:

def get_sequences(list_of_ints):
    sequence_count = 1
    sequences = []
    for row in list_of_ints:
        next_item = None
        if list_of_ints.index(row) < (len(list_of_ints) - 1):
            next_item = list_of_ints[list_of_ints.index(row) + 1]

        if (row + 1) == next_item:
            sequence_count += 1
        else:
            first_in_sequence = list_of_ints[list_of_ints.index(row) - sequence_count + 1]
            sequences.append([first_in_sequence, sequence_count])
            sequence_count = 1

    return sequences

Then run another loop to delete然后运行另一个循环删除

    for sequence in sequences:
        sheet.delete_rows(sequence[0], sequence[1])

Personally, I would do two things:就我个人而言,我会做两件事:

first transform the list into a set so the lookup of the item takes less time首先将列表转换为一个集合,以便查找项目花费更少的时间

remove_set = set(remove_list)
...
if ws.cell(row=count, column=2).value in remove_set:

then I would avoid removing the rows in place, as it takes a lot of time to reorganise the data structures representing the sheet.然后我会避免删除原地的行,因为重新组织表示工作表的数据结构需要很多时间。

I would create a new blank worksheet and add to it only the rows which must be kept.我会创建一个新的空白工作表,并只添加必须保留的行。

Then save the new worksheet, overwriting the original if you wish.然后保存新工作表,如果您愿意,可以覆盖原始工作表。

If it still takes too long, consider using a CSV format so you can treat the input data as text and output it the same way, re-importing the data later from the spreadsheet program (eg Ms-Excel)如果仍然需要太长时间,请考虑使用 CSV 格式,以便您可以将输入数据视为文本并以相同方式输出,稍后从电子表格程序(例如 Ms-Excel)重新导入数据

Have a look at the official docs and at this tutorial to find out how to use the CSV library查看官方文档本教程以了解如何使用 CSV 库

Further note: as spotted by @Charlie Clark, the calculation of进一步注意:正如@Charlie Clark 所发现的那样,计算

ws.max_row

may take some time as well and there is no need to repeat it.也可能需要一些时间,无需重复。

To do that, the easiest solution is to work backwards from the last row down to the first, so that the deleted rows do not affect the position of the ones before them.要做到这一点,最简单的解决方案是从最后一行倒退到第一行,这样被删除的行就不会影响它们前面的行的位置。

When a number of rows have to be deleted from a sheet, I create a list of these row numbers, eg remove_list and then I rewrite the sheet to a temporary sheet, excluding these rows.当必须从工作表中删除许多行时,我创建这些行号的列表,例如remove_list ,然后将工作表重写为临时工作表,不包括这些行。 I delete the original sheet and rename the temporary sheet to the original sheet.我删除原始工作表并将临时工作表重命名为原始工作表。 See my function for doing this below:请参阅我在下面执行此操作的功能:

def delete_excel_rows_with_openpyxl(workbook, sheet, remove_list): 
    """ Delete rows with row numbers in remove_list from sheet contained in workbook """ 

    temp_sheet = workbook.create_sheet('TempSheet')

    destination_row_counter = 1
    for source_row_counter, source_row in enumerate(sheet.iter_rows(min_row=1, max_row=sheet.max_row)):

        try:
            i = remove_list.index(source_row_counter+1) # enumerate counts from 0 and sheet from 1
            # do not copy row
            del remove_list[i]
        except ValueError:
            # copy row
            column_count = 1
            for cell in source_row:
                temp_sheet.cell(row=destination_row_counter, column=column_count).value = cell.value
                column_count = column_count + 1

            destination_row_counter = destination_row_counter + 1

    sheet_title = sheet.title
    workbook.remove_sheet(sheet)
    temp_sheet.title = sheet_title

    return workbook, temp_sheet   

Adding on to ketdaddy's response.添加到 ketdaddy 的回复中。 I tested it and noticed that when you use this sequence in a for loop as suggested, you need to update the row number in every loop to account for the deleted rows.我对其进行了测试,并注意到当您按照建议在 for 循环中使用此序列时,您需要更新每个循环中的行号以说明已删除的行。

For example, when you get to the second step in the loop, the start row is not the original start row, it's the original start row minus the rows which were previously deleted.例如,当您到达循环的第二步时,起始行不是原始起始行,而是原始起始行减去先前删除的行。

This code will update ketdaddy's sequence to generate a sequence which takes this into account.此代码将更新 ketdaddy 的序列以生成考虑到这一点的序列。

original sequence = get_sequences(deleterows)
updated_sequence=[]
cumdelete = 0
for start, delete in original sequence:
    new_start = start-cumdelete
    cumdelete = cumdelete + delete
    updated_sequence.append([new_start, delete])

updated_sequence

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM