简体   繁体   中英

Python. Deleting Excel rows while iterating. Alternative for OpenPyXl or solution for ws.max_rows wrong output

I'm working with Python on Excel files. Until now I was using OpenPyXl. I need to iterate over the rows and delete some of them if they do not meet specific criteria let's say I was using something like:

current_row = 1
while current_row <= ws.max_row
    if 'something' in ws[f'L{row}'].value:
        data_ws.delete_rows(current_row)
        continue
    current_row += 1

Everything was alright until I have encountered problem with ws.max_rows . In a new Excel file which I've received to process ws.max_rows was returning more rows than it was in the reality. After some googling I've found out why is it happening. Here's a great explanation of the problem which I've found in the comment section on the Stack:

However, ws.max_row will not check if last rows are empty or not. If cell's content at the end of the worksheet is deleted using Del key or by removing duplicates, remaining empty rows at the end of your data will still count as a used row. If you do not want to keep these empty rows, you will have to delete those entire rows by selecting rows number on the left of your spreadsheet and deleting them (right click on selected row number(s) -> Delete) – V. Brunelle Thanks V. Brunelle for very good explanation of the cause of the problem.

In my case it is because some of the rows are deleted by removing duplicates. For eg there's 400 rows in my file listed one by one (without any gaps) but ws.max_row is returning 500

For now I'm using a quick fix:

while current_row <= len([row for row in data_ws.iter_rows(min_row=min_row) if not all([cell.value is None for cell in row])])

But I know that it is very inefficient. That's the reason why I'm asking this question. I'm looking for possible solution. From what I've found here on the Stack I can:

  1. Create a copy of the worksheet and iterate over that copy and ws.delete_rows in the original worksheet so I will need to my fix only once
  2. Iterate backwards with for_loop so I won't have to deal with ws.max_rows since for_loops works fine in that case (they read proper file dimensions). This method seems promising for me, but always I've got 4 rows at the top of the workbook which I'm not touching at all and potential debugging would need to be done backwards as well, which might not be very enjoyable:D.
  3. Use other python library to process Excel files, but I don't know which one would be better, because keeping workbook styles is very important to me (and making changes in them if needed). I've read some promising things about pywin32 library (win32com.client), but it seems lacking documentation and it might be hard to work with it and also I don't know how does it look in performance matter. I was also considering pandas, but in kind words it's messing up the styles (in reality it deletes all styles in the worksheet).

I'm stuck now, because I really don't know which route should I choose.

I would appreciate every advice/opinion in the topic and if possible I would like to make a small discussion here.

Best regards!

If max rows doesn't report what you expect you'll need to sort the issue best you can and perhaps that might be by manually deleting; " delete those entire rows by selecting rows number on the left of your spreadsheet and deleting them (right click on selected row number(s) -> Delete) " or making some other determination in your code as what the last row is, then perhaps programatically deleting all the rows from there to max_row so at least it reports correctly on the next code run.

You could also incorporate your fix code into your example code for deleting rows that meet specific criteria.

For example; a test sheet has 9 rows of data but cell B15 is an empty string so max_rows returns 15 rather than 9.
The example code checks each used cell in the row for None type in the cell value and only processes the 9 rows with data.

from openpyxl import load_workbook


filename = "foo.xlsx"

wb = load_workbook(filename)
data_ws = wb['Sheet1']

print(f"Max Rows Reports {data_ws.max_row}")

for row in data_ws:
    print(f"Checking row {row[0].row}")
    if all(cell.value is not None for cell in row):
        if 'something' in data_ws[f'L{row[0].row}'].value:
            data_ws.delete_rows(row[0].row)
    else:
        print(f"Actual Max Rows is {row[0].row}")
        break

wb.save('out_' + filename)

Output

Max Rows Reports 15
Checking row 1
Checking row 2
Checking row 3
Checking row 4
Checking row 5
Checking row 6
Checking row 7
Checking row 8
Checking row 9
Actual Max Rows is 9

Of course this is not perfect, if any of the 9 rows with data had one cell value of None the loop would stop at that point. However if you know that's not going to be the case it may be all you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM