Read in xlsx in Pandas with openpyxl

Question

From what I've read online, Pandas read_excel function has removed support for xlsx files but it's supposed to be easy to read them in but just using the openpyxl engine.

When I run the following I get an error that says "unexpected keyword argument synchVertical " Here's my code:

pd.read_excel( path.join(data_dir,"opto_data.xlsx"), engine = 'openpyxl' )

And here are the dependencies I have installed...

pandas-1.2.4
openpyxl-3.0.7

I just realized it might be the new version of vs-code that broke it

Answer 1

Try this

X = pd.ExcelFile("filename.xlsx")
df = X.parse("sheet name here")

It works on both engine xlrd as well as openpyxl Also install xlrd for better experience

pip install xlrd

Answer 2

An.xlsx workbook is a zipped archive of xml files. According to the API reference the xml file for a worksheet can contain a 'worksheet properties' element ( <sheetPr> ) with an attribute syncVertical (without an h ). However opening a workbook with syncVertical in Excel causes an error, while synchVertical works fine. Other software seems to have followed Excel in creating workbooks with the 'wrong' spelling, while openpyxl only accepts syncVertical as per the specs.

Hopefully openpyxl will follow other software in accepting the misspelt attribute. In the meantime a fix is to remove the attribute. This can be done manually by opening the workbook in Excel and saving it again, which seems to remove the attribute. Alternatively, we can adapt this answer to edit the zip archive. Unfortunately it is a bit slow as it has to read and write the whole archive just to remove this one attribute. As a hacky quick solution we use a simple find/replace to take out the unwanted property. A better but slower solution would be to parse the xml files properly.

import tempfile
from zipfile import ZipFile
import shutil
import os
from fnmatch import fnmatch

def change_in_zip(file_name, name_filter, change):
    tempdir = tempfile.mkdtemp()
    try:
        tempname = os.path.join(tempdir, 'new.zip')
        with ZipFile(file_name, 'r') as r, ZipFile(tempname, 'w') as w:
            for item in r.infolist():
                data = r.read(item.filename)
                if fnmatch(item.filename, name_filter):
                    data = change(data)
                w.writestr(item, data)
        shutil.move(tempname, file_name)
    finally:
        shutil.rmtree(tempdir)

change_in_zip("opto_data.xlsx", 
              name_filter='xl/worksheets/*.xml', # the problematic property is found in the worksheet xml files
              change=lambda d: d.replace(b' synchVertical="1"', b' '))

Read in xlsx in Pandas with openpyxl

Question

2 answers

solution1
0 2021-06-10 17:51:44

solution2
0 2021-07-12 18:32:24

Read in xlsx in Pandas with openpyxl

Question

2 answers

solution1 0 2021-06-10 17:51:44

solution2 0 2021-07-12 18:32:24

solution1
0 2021-06-10 17:51:44

solution2
0 2021-07-12 18:32:24