简体   繁体   中英

Openpyxl: How to troubleshoot opening a workbook

tl;dr: What are some good first steps for figuring out what is hanging up openpyxl when it is trying to load a workbook?

Long version: So, I've come across a few 'why doesn't it work' like questions on SO for openpyxl but haven't seen much in the way of actual attempts to discover/fix the problem.

I just started checking out openpyxl and it seems pretty promising but, while just starting, I've run into a problem: I have a variety of workbooks that are pretty complex. I'd like to make a good attempt of at least reading data from them. The workbook I am using isn't huge (~750kB), but it does have a lot in it: conditional formatting, data validation, named ranges, vba content, etc. When I try to open the workbook, I get a warning about the data validation (Ok, no big deal) but then it cranks on the CPU and accomplishes nothing for a long time - I don't know if it will ever finish because inevitably, I need to move on so I quit. Regardless, the loading, if it would ever finish, is way too slow to be useful.

So, I'd love it if somebody could suggest some solid first steps to identifying what the hold-up is so I can try to make this work either by removing the offending content from the workbook or ideally by doing something on the python side to handle things more smoothly.

Just for clarity, here's the two lines of code I started with:

from openpyxl import Workbook, load_workbook
wb = load_workbook('book.xlsm')

As @CharlieClark guessed, the problem with my particular workbook was data validation set for entire columns. In the interest of providing a satisfactory answer to this question, I did a little experimentation anyway, trying to see how I could have deduced this on my own. Since I don't think I could possibly write a how-to that acutally covers anybody else's problem, I tried two methods of looking at the problem, based on @BoarGules and @CharlieClark's suggestions and wrote them up as examples:

Method 1: Split the workbook into smaller parts, compare loading

If you are trying to figure out what is holding up the process, I'd recommend having a good think about what the workbook contains and what might cause openpyxl to be doing a bunch of extra processing (more on this in Method 2). Rather than simply splitting each sheet into a new workbook and trying to load each (I did this and most of the smaller workbooks would not load - most of my sheets had basically the same structure and the same problem for loading), I'd try thinking about what content you have - data validation, conditional formatting, what-have-you - and removing one content type at a time.

When I removed all of the data vaditation in the workbook, it suddenly loaded like a snap!

Method 2: Get more familiar with the source code and try profile -ing your way into what the problem is

I found this method to be a lot more satisfying as I at least now have a (very vague) idea of how openpyxl loads workbooks but this method does require wading through the source code to think through the problem - if you don't want to do that, stick with Method 1. This method also requires having a good sample workbook that loads OK for comparison. For me, since my first attempt was following @CharlieClark's guess and removing all the data validation, I used the 'fixed' workbook for comparison, which was kind of cheating but oh well.

With my good workbook, I ran a quick profile on the workbook_load function to see what it looks like. I found it most usefull to sort results by the 'tottime', or the total time spent in each function:

>>>import profile
>>>from openpyxl import load_workbook
>>>profile.run('wb = load_workbook("good.xlsm")', sort='tottime')

     4228306 function calls (4186125 primitive calls) in 12.859 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    84364    1.828    0.000    6.234    0.000 worksheet.py:138(parse_cell)
   267315    0.625    0.000    0.625    0.000 :0(get)
   197422    0.594    0.000    0.953    0.000 ElementTree.py:1286(read_events)
      284    0.578    0.002    0.578    0.002 :0(feed)
12986/8974    0.516    0.000    2.125    0.000 serialisable.py:42(from_tree)
   538565    0.500    0.000    0.500    0.000 :0(isinstance)
    89339    0.500    0.000    1.000    0.000 cell.py:43(coordinate_from_string)
24877/9318    0.438    0.000    0.984    0.000 serialisable.py:187(__hash__)
   193137    0.422    0.000    0.422    0.000 :0(match)
     5119    0.422    0.000    6.734    0.001 worksheet.py:259(parse_row)
    73194    0.406    0.000    0.562    0.000 base.py:40(__set__)
   168920    0.375    0.000    0.375    0.000 :0(find)
   197144    0.344    0.000    1.906    0.000 ElementTree.py:1218(iterator)
   251298    0.312    0.000    0.312    0.000 :0(getattr)
    84364    0.312    0.000    0.812    0.000 cell.py:106(__init__)
    84364    0.297    0.000    1.203    0.000 cell.py:181(coordinate_to_tuple)
...

I ran a profile of loading this and a few other workbooks and it looks like the execution time is spent mostly in the worksheet.py module (like above) and in the serialisable.py module, which I think makes sense as that's where much of the reading / processing data happens.

For comparison, when I let the bad workbook load for a while then abort, here is what I got for a profile:

>>>import profile
>>>from openpyxl import load_workbook
>>>profile.run('wb = load_workbook("bad.xlsm")', sort='tottime')

     14111962 function calls (14076527 primitive calls) in 27.797 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  3045757    9.797    0.000   20.359    0.000 cell.py:157(rows_from_range)
6091514/6091513    7.062    0.000   10.562    0.000 cell.py:166(<genexpr>)
  3045783    3.500    0.000    3.500    0.000 :0(format)
       19    2.797    0.147   23.156    1.219 :0(extend)
       19    0.469    0.025   23.625    1.243 datavalidation.py:59(expand_cell_ranges)
   366802    0.375    0.000    0.375    0.000 :0(isinstance)
24877/9318    0.344    0.000    0.672    0.000 serialisable.py:187(__hash__)
    15947    0.234    0.000    1.125    0.000 worksheet.py:138(parse_cell)
12686/8831    0.219    0.000   25.172    0.003 serialisable.py:42(from_tree)
   250829    0.188    0.000    0.188    0.000 :0(getattr)
       90    0.172    0.002    0.172    0.002 :0(feed)
    63452    0.172    0.000    0.297    0.000 base.py:40(__set__)
    54974    0.125    0.000    0.125    0.000 :0(get)
    44462    0.125    0.000    0.125    0.000 :0(match)
    18172    0.109    0.000    0.203    0.000 cell.py:43(coordinate_from_string)
    56006    0.094    0.000    0.156    0.000 ElementTree.py:1286(read_events)
    83605    0.078    0.000    0.078    0.000 base.py:25(__set__)
    17709    0.078    0.000    0.141    0.000 sequence.py:24(__set__)
...

So looking at this profile, you can see that most of the execution time is being spent processing cell addresses ( rows_from_range ) rather than looking at the actual data, like we saw in the first profile. I assume here that this is not wanted. If you look at the fifth line in the profile table, we're also spending a lot of time in or under ( cumtime column) the datavalidation.py function, expand_cell_ranges , for all the fact that it was only called a handful of times and which didn't show up anywhere near the top in the other profile. When I dug through the source code, I saw that expand_cell_ranges function calls the rows_from_range function in a loop! I think from there, we can resonably conclude that, in this case, something about the data validation is causing openpyxl to try to process a whole whack of cell addresses that have nothing useful in them. Since I already know that my workbook had data validation set for entire columns of empty cells, I'd count this as a pretty solid confirmation of the diagnosis.

If anybody reading this needs to try to reverse-engineer their way into diagnosing a workbook that won't load, I'd compare the first profile above to the profile for loading their problem workbook and see what changed. That should at least provide a good starting point for guessing why it changed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM