Python OpenPyxl trouble detecting all merged cells

Question

I am trying to detect all the merged cells in a openpyxl.worksheet.worksheet.Worksheet object and it seems that the merged_cells.ranges cannot all the merged cells but merged cells in some columns. My goal is to detect the merged cells, unmerge them and then remerge certain cells based on column values. During unmerging, I fill the unmerged cells with the top-left cell value of the merged cell.

I have worked around with this problem by filling nan in cells which are supposed to be recognized as a merged cell with previous value in the column since all my merged cells range in the same column, for example A18:A19, B18:B19. But things have become more tricky after I updated my xlsx file. OpenPyxl didn't find merged cells in A, C and E columns in my previous xlsx. Now it has trouble finding merged cells in B, D and F columns. Two xlsx file has the same format but different data.

Here is an example of what my xlsx looks like: xlsx sample

My code to read the xlsx then detect & unmerge the merged cells:

client_info_wb = load_workbook(path_client_info)
sheet_name = client_info_wb.sheetnames[0]
client_info_ws = client_info_wb[sheet_name]

for cell_group in client_info_ws.merged_cells.ranges:
    print(cell_group)
    min_col, min_row, max_col, max_row = range_boundaries(str(cell_group))
    top_left_cell_value = client_info_ws.cell(row=min_row, column=min_col).value
    print(top_left_cell_value)
    client_info_ws.unmerge_cells(str(cell_group))
    for row in client_info_ws.iter_rows(min_col=min_col, min_row=min_row, max_col=max_col, max_row=max_row):
        for cell in row:
            cell.value = top_left_cell_value

Output for print(cell_group) :

A48:A49
2021-01-05
C48:C49
XX5614
E48:E49
ID
A46:A47
2021-01-05
C46:C47
XX2134
E46:E47
ID
A44:A45
2021-01-05
C44:C45
XX1234
E44:E45
ID

The information in those columns where openpyxl merged_cells.ranges fails to detect merged cells is necessary to the following operations in my code. So can anyone help me with this? Are there anyone having the same issue? I have spent a long time trying to find the patterns in my xlsx to find out what is causing the trouble and had no luck.

Answer 1

    while sheet.merged_cells: # <- Here's the change to make.
        for cell_group in sheet.merged_cells:
            val = str(cell_group.start_cell.value).strip()
            sheet.unmerge_cells(str(cell_group))
            for merged_cell in cell_group.cells:
                sheet.cell(row=merged_cell[0], column=merged_cell[1]).value = val

It seems like the set of merged_cells changes as it is iterated over, so repeating the loop until merged_cells is None does the trick.

There's also something weird going on with the in-memory buffer, so I save the file to disk and reload it with pandas, rather than loading the dataframe from the sheet in memory. (This could easily be optimized with a BytesIO object.)

For me, this guarantees that all merged cells are unmerged and replaced with the start cell's value.

Python OpenPyxl trouble detecting all merged cells

Question

1 answers

solution1
0 2021-10-13 22:46:52

Python OpenPyxl trouble detecting all merged cells

Question

1 answers

solution1 0 2021-10-13 22:46:52

solution1
0 2021-10-13 22:46:52