简体   繁体   中英

How to use Python to read Excel files that contain extended fonts? (openpyxl error: Max value is 14)

As a learning project for Python, I am attempting to read all Excel files in a directory and extract the names of all the sheets.

I have been trying several available Python modules to do this ( pandas in this example), but am running into an issue with most of them depending on openpyxl .

This is my current code:

import os
import pandas

directory_root = 'D:\\testFiles'

# Dict to hold all files, stats
all_files = {}

for _current_path, _dirs_in_path, _files_in_path in os.walk(directory_root):

    # Add all files to this `all_files`
    for _file in _files_in_path:
        # Extract filesystem stats from the file
        _stats = os.stat(os.path.join(_current_path, _file))

        # Add the full file path and its stats to the `all_files` dict.
        all_files[os.path.join(_current_path, _file)] = _stats

# Loop through all found files to extract the sheet names
for _file in all_files:

    # Open the workbook
    xls = pandas.ExcelFile(_file)

    # Loop through all sheets in the workbook
    for _sheet in xls.sheet_names():
        print(_sheet)

This raises an error from openpyxl when calling pandas.ExcelFile() : ValueError: Max value is 14 .

From what I can find online, this is because the file contains a font family above 14. How do I read from an Excel (xlsx) file while disregarding any existing formatting?

The only potential solution I could find suggests modifying the original file and removing the formatting, but this is not an option as I do not want to modify the files in any way.

Is there another way to do this that doesn't have this formatting limitation?

It is easy to detect when family value is out of range by a simple unzip|find in windows or a grep in others. So you could filter out files based on those values. here we see in the bad boy example they are acceptable 2 and an unacceptable 34

在此处输入图像描述

However since all platforms (win 10 included) have TAR it is easiest to first expand the file.xlsx as a set and using find by file in the native OS (or python) then ensure you know exactly which file needs adjusting.

在此处输入图像描述

So we now know it is styles.xml (that's not surprising as font values should be there)

and at this point we can use string replace to change that entry to say

      <family val="3"/>

if that's more useful for your purpose.

Then repack the adjusted xlsx (NOTE:- it is best to only use a tool to "update" the one style.xls file to maintain the zip relative order) and it should behave just the same as a standard.xlsx that has the standard 1-14 fonts, pre-suming the author did not introduce other errors.

This is most probably not because of a font size or family, because it gives ValueError. What I see from this page and this page , it seems that one of your float value in excel file must not be more than 14. That's why it gives the error ValueError: Max value is 14 . You may dive into the file and search for a value which is more than 14 and try your code by manipulating the value.

The issue is that your file does not conform to the Open Office specification. Only certain font families are allowed. Once openpyxl encounters a font out of specification, it throws this error because OpenPyxl only allows spec-conforming excel files .

Some Excel readers may not have an issue with this and are more flexible with non-OpenOffice-spec-conforming files, but openpyxl only implements the Apache Open Office spec.

The xml being parsed will contain information about the font like this:

<font>
  <b/>
  <sz val="11"/>
  <color rgb="FF000000"/>
  <name val="Century Gothic"/>
  <family val="34"/>
</font>

If the family value is over 14, openpyxl throws this ValueError . There is an underlying descriptor in Open Office that controls this.

When other readers like, say, Microsoft Office 365 Excel encounters this, it will change the font family when loading the file to a compliant font (the default, Calibri).

As a workaround, if you don't want to change the value (as Microsoft Excel does), you can monkeypatch the descriptor to allow a larger max font family.

# IMPORTANT, you must do this before importing openpyxl
from unittest import mock
# Set max font family value to 100
p = mock.patch('openpyxl.styles.fonts.Font.family.max', new=100)
p.start()
import openpyxl
openpyxl.open('my-bugged-worksheet.xlsx') # this works now!

This can be reproduced using this excel workbook . Before the patch this will fail to load. After the patch, it loads without error.

Here's what fixed this error for me. I edited lib\site-packages\openpyxl\descriptors\base.py and added a print statement after line 86 in class Max like so:

def __set__(self, instance, value):
    if ((self.allow_none and value is not None)
        or not self.allow_none):
        value = _convert(self.expected_type, value)
        if value > self.max:
            print(f"value is {value}")
            raise ValueError('Max value is {0}'.format(self.max))
    super(Max, self).__set__(instance, value)

This printed the value of 34 which is obviously higher than the max value of 14.
All I did to make it work was comment out the line that raise s the error.
Changing the code to:

def __set__(self, instance, value):
    if ((self.allow_none and value is not None)
        or not self.allow_none):
        value = _convert(self.expected_type, value)
        if value > self.max:
            self.max = value
            # print(f"value is {value}")
            # raise ValueError('Max value is {0}'.format(self.max))
    super(Max, self).__set__(instance, value)

This solved the problem for me.
Or if you need to distribute the file and have to use the orignal library code THEN try the first answer .

# IMPORTANT, you must do this before importing openpyxl
from unittest import mock
# Set max font family value to 100
p = mock.patch('openpyxl.styles.fonts.Font.family.max', new=100)
p.start()
import openpyxl
openpyxl.open('my-bugged-worksheet.xlsx') # this works now!

Before importing openpyxl.

If I right, you want to get all of xlsx sheet name from files in a directory so you can do this:

import pandas as pd
import os
dirpth = './Target Folder/'
for dirpath, dirnames, filenames in os.walk(dirpth):
    file_names = filenames
file_names = [dirpth+file_names[i] for i in range(len(file_names))]
data = []
sheet_names = []
for names in file_names:
    df = pd.ExcelFile(names,engine = 'openpyxl')
    data_sheet = []
    sheet_temp = []
    for name in df.sheet_names:
        data_sheet.append(df.parse(nama,index_col = [0]))
        sheet_temp.append(name)
    data.append(data_sheet)
    sheet_names.append(sheet_temp)

In this way, you will get data from each sheet automatically for each excel file, but it will give error where you have file with difference extension in the same folder (for example in the same folder you have.csv file). So you need to filter all of file name first or you can use try except statement to skip non excel file. If your.py file have difference path with your folder target, just change dirpath, example: 'D:/changeYour Folder Path/Example/Target/'

Note: You need to install openpyxl

This problem could be solved by cleaning xlsx styles completely , here is my code how to do it with pandas though openpyxl https://stackoverflow.com/a/71526058/1731460

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM