简体   繁体   English

使用 openpyxl 从大文件中删除行

[英]Deleting rows from a large file using openpyxl

i'm working with openpyxl on a.xlsx file which has around 10K products, of which some are "regular items" and some are products that need to be ordered when required.我正在使用 openpyxl 处理 a.xlsx 文件,该文件包含大约 10K 产品,其中一些是“常规物品”,有些是需要在需要时订购的产品。 For the project I'm doing I would like to delete all of the rows containing the items that need to be ordered.对于我正在做的项目,我想删除所有包含需要订购的项目的行。

I tested this with a small sample size of the actual workbook and did have the code working the way I wanted to.我用实际工作簿的小样本量对此进行了测试,并且确实让代码按照我想要的方式工作。 However when I tried this in the actual workbook with 10K rows it seems to be taking forever to delete those rows (it has been running for nearly and hour now).但是,当我在具有 10K 行的实际工作簿中尝试此操作时,似乎需要永远删除这些行(它已经运行了将近一个小时)。

Here's the code that I used:这是我使用的代码:

wb = openpyxl.load_workbook('prod.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
def clean_workbook():
    for row in sheet:
        for cell in row:
            if cell.value == 'ordered':
                sheet.delete_rows(cell.row)

I would like to know is there a faster way of doing this with some tweaks in my code?我想知道是否有一种更快的方法可以通过我的代码中的一些调整来做到这一点? Or is there a better way to just read just the regular stock from the workbook without deleting the unwanted items?或者有没有更好的方法来只从工作簿中读取常规库存而不删除不需要的项目?

You can open with read-only mode, and import all content into a list, then modify in list is always a lot more faster than working in excel.您可以以只读模式打开,并将所有内容导入列表,然后在列表中修改总是比在 excel 中工作快很多。 After you modify the list, made a new worksheet and upload your list back to excel.修改列表后,制作一个新工作表并将您的列表上传回 excel。 I did this way with my 100k items excel.我用我的 100k 个项目 excel 这样做了。

Deleting rows in loops can be slow because openpyxl has to update all the cells below the row being deleted.删除循环中的行可能会很慢,因为 openpyxl 必须更新被删除行下方的所有单元格。 Therefore, you should do this as little as possible.因此,您应该尽可能少地执行此操作。 One way is to collect a list of row numbers, check for contiguous groups and then delete using this list from the bottom.一种方法是收集行号列表,检查连续组,然后使用此列表从底部删除。

A better approach might be to loop through ws.values and write to a new worksheet filtering out the relevant rows.更好的方法可能是遍历ws.values并写入过滤掉相关行的新工作表。 Copy any other relevant data such as formatting, etc. Then you can delete the original worksheet and rename the new one.复制任何其他相关数据,例如格式等。然后您可以删除原始工作表并重命名新工作表。

ws1 = wb['My Sheet']
ws2 = wb.create_sheet('My Sheet New')

for row in ws1.values:
    if row[x] == "ordered": # we can assume this is always the same column
       continue
    ws2.append(row)

del wb["My Sheet"]
ws2.title = "My Sheet"

For more sophisticated filtering you will probably want to load the values into a Pandas dataframe, make the changes and then write to a new sheet.对于更复杂的过滤,您可能希望将值加载到 Pandas dataframe 中,进行更改,然后写入新工作表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM