简体   繁体   English

如何使用Python读取包含扩展fonts的Excel文件? (openpyxl 错误:最大值为 14)

[英]How to use Python to read Excel files that contain extended fonts? (openpyxl error: Max value is 14)

As a learning project for Python, I am attempting to read all Excel files in a directory and extract the names of all the sheets.作为 Python 的学习项目,我试图读取目录中的所有 Excel 文件并提取所有工作表的名称。

I have been trying several available Python modules to do this ( pandas in this example), but am running into an issue with most of them depending on openpyxl .我一直在尝试几个可用的 Python 模块来执行此操作(在本例中为pandas ),但我遇到了一个问题,其中大部分取决于openpyxl

This is my current code:这是我当前的代码:

import os
import pandas

directory_root = 'D:\\testFiles'

# Dict to hold all files, stats
all_files = {}

for _current_path, _dirs_in_path, _files_in_path in os.walk(directory_root):

    # Add all files to this `all_files`
    for _file in _files_in_path:
        # Extract filesystem stats from the file
        _stats = os.stat(os.path.join(_current_path, _file))

        # Add the full file path and its stats to the `all_files` dict.
        all_files[os.path.join(_current_path, _file)] = _stats

# Loop through all found files to extract the sheet names
for _file in all_files:

    # Open the workbook
    xls = pandas.ExcelFile(_file)

    # Loop through all sheets in the workbook
    for _sheet in xls.sheet_names():
        print(_sheet)

This raises an error from openpyxl when calling pandas.ExcelFile() : ValueError: Max value is 14 .这会在调用pandas.ExcelFile()时从openpyxl引发错误: ValueError: Max value is 14

From what I can find online, this is because the file contains a font family above 14. How do I read from an Excel (xlsx) file while disregarding any existing formatting?根据我在网上可以找到的信息,这是因为该文件包含 14 以上的字体系列。如何在忽略任何现有格式的情况下读取 Excel (xlsx) 文件?

The only potential solution I could find suggests modifying the original file and removing the formatting, but this is not an option as I do not want to modify the files in any way.我能找到的唯一可能的解决方案是修改原始文件并删除格式,但这不是一个选项,因为我不想以任何方式修改文件。

Is there another way to do this that doesn't have this formatting limitation?还有另一种没有格式限制的方法吗?

It is easy to detect when family value is out of range by a simple unzip|find in windows or a grep in others.通过在 windows 或 grep 中的简单解压缩|查找,很容易检测到家庭值何时超出范围。 So you could filter out files based on those values.因此,您可以根据这些值过滤掉文件。 here we see in the bad boy example they are acceptable 2 and an unacceptable 34在这里,我们在 bad boy 示例中看到它们是可以接受的 2 和不可接受的 34

在此处输入图像描述

However since all platforms (win 10 included) have TAR it is easiest to first expand the file.xlsx as a set and using find by file in the native OS (or python) then ensure you know exactly which file needs adjusting.但是,由于所有平台(包括 win 10)都有 TAR,因此最简单的方法是首先将 file.xlsx 扩展为一组并在本机操作系统(或 python)中使用按文件查找,然后确保您确切知道需要调整哪个文件。

在此处输入图像描述

So we now know it is styles.xml (that's not surprising as font values should be there)所以我们现在知道它是 styles.xml(这并不奇怪,因为字体值应该在那里)

and at this point we can use string replace to change that entry to say此时我们可以使用字符串替换来更改该条目说

      <family val="3"/>

if that's more useful for your purpose.如果这对您的目的更有用。

Then repack the adjusted xlsx (NOTE:- it is best to only use a tool to "update" the one style.xls file to maintain the zip relative order) and it should behave just the same as a standard.xlsx that has the standard 1-14 fonts, pre-suming the author did not introduce other errors.然后重新打包调整后的 xlsx(注意:最好只使用工具“更新”一个 style.xls 文件以维护 zip 的相对顺序),它的行为应该与具有标准的 standard.xlsx 相同1-14 fonts,假设作者没有引入其他错误。

This is most probably not because of a font size or family, because it gives ValueError.这很可能不是因为字体大小或系列,因为它给出了 ValueError。 What I see from this page and this page , it seems that one of your float value in excel file must not be more than 14. That's why it gives the error ValueError: Max value is 14 .我从这个页面这个页面看到的,似乎你在 excel 文件中的一个浮点值不能超过 14。这就是它给出错误ValueError: Max value is 14的原因。 You may dive into the file and search for a value which is more than 14 and try your code by manipulating the value.您可以深入该文件并搜索大于 14 的值,然后通过操作该值来尝试您的代码。

The issue is that your file does not conform to the Open Office specification.问题是您的文件不符合 Open Office 规范。 Only certain font families are allowed.仅允许使用某些字体系列。 Once openpyxl encounters a font out of specification, it throws this error because OpenPyxl only allows spec-conforming excel files .一旦openpyxl遇到不符合规范的字体,它就会抛出这个错误,因为OpenPyxl 只允许符合规范的 excel 文件

Some Excel readers may not have an issue with this and are more flexible with non-OpenOffice-spec-conforming files, but openpyxl only implements the Apache Open Office spec.一些 Excel 阅读器可能对此没有问题,并且对不符合 OpenOffice 规范的文件更灵活,但 openpyxl 仅实现 Apache Open Office 规范。

The xml being parsed will contain information about the font like this:正在解析的 xml 将包含有关字体的信息,如下所示:

<font>
  <b/>
  <sz val="11"/>
  <color rgb="FF000000"/>
  <name val="Century Gothic"/>
  <family val="34"/>
</font>

If the family value is over 14, openpyxl throws this ValueError .如果家庭值超过 14,openpyxl 会抛出这个ValueError There is an underlying descriptor in Open Office that controls this. Open Office 中有一个底层描述符来控制它。

When other readers like, say, Microsoft Office 365 Excel encounters this, it will change the font family when loading the file to a compliant font (the default, Calibri).当其他读者(例如 Microsoft Office 365 Excel)遇到这种情况时,它会在将文件加载为兼容字体(默认为 Calibri)时更改字体系列。

As a workaround, if you don't want to change the value (as Microsoft Excel does), you can monkeypatch the descriptor to allow a larger max font family.作为一种解决方法,如果您不想更改该值(如 Microsoft Excel 所做的那样),您可以对描述符进行猴子补丁以允许更大的最大字体系列。

# IMPORTANT, you must do this before importing openpyxl
from unittest import mock
# Set max font family value to 100
p = mock.patch('openpyxl.styles.fonts.Font.family.max', new=100)
p.start()
import openpyxl
openpyxl.open('my-bugged-worksheet.xlsx') # this works now!

This can be reproduced using this excel workbook .这可以使用此 excel 工作簿进行复制。 Before the patch this will fail to load.在补丁之前,这将无法加载。 After the patch, it loads without error.补丁后,它加载没有错误。

Here's what fixed this error for me.这是为我解决此错误的原因。 I edited lib\site-packages\openpyxl\descriptors\base.py and added a print statement after line 86 in class Max like so:我编辑了lib\site-packages\openpyxl\descriptors\base.py并在 class Max 的第86行之后添加了一个打印语句,如下所示:

def __set__(self, instance, value):
    if ((self.allow_none and value is not None)
        or not self.allow_none):
        value = _convert(self.expected_type, value)
        if value > self.max:
            print(f"value is {value}")
            raise ValueError('Max value is {0}'.format(self.max))
    super(Max, self).__set__(instance, value)

This printed the value of 34 which is obviously higher than the max value of 14.这打印出34的值,明显高于最大值 14。
All I did to make it work was comment out the line that raise s the error.我所做的只是注释掉raise错误的行。
Changing the code to:将代码更改为:

def __set__(self, instance, value):
    if ((self.allow_none and value is not None)
        or not self.allow_none):
        value = _convert(self.expected_type, value)
        if value > self.max:
            self.max = value
            # print(f"value is {value}")
            # raise ValueError('Max value is {0}'.format(self.max))
    super(Max, self).__set__(instance, value)

This solved the problem for me.这为我解决了这个问题。
Or if you need to distribute the file and have to use the orignal library code THEN try the first answer .或者,如果您需要分发文件并且必须使用原始库代码,那么请尝试第一个答案

# IMPORTANT, you must do this before importing openpyxl
from unittest import mock
# Set max font family value to 100
p = mock.patch('openpyxl.styles.fonts.Font.family.max', new=100)
p.start()
import openpyxl
openpyxl.open('my-bugged-worksheet.xlsx') # this works now!

Before importing openpyxl.在导入 openpyxl 之前。

If I right, you want to get all of xlsx sheet name from files in a directory so you can do this:如果我是对的,您想从目录中的文件中获取所有 xlsx 工作表名称,以便您可以执行以下操作:

import pandas as pd
import os
dirpth = './Target Folder/'
for dirpath, dirnames, filenames in os.walk(dirpth):
    file_names = filenames
file_names = [dirpth+file_names[i] for i in range(len(file_names))]
data = []
sheet_names = []
for names in file_names:
    df = pd.ExcelFile(names,engine = 'openpyxl')
    data_sheet = []
    sheet_temp = []
    for name in df.sheet_names:
        data_sheet.append(df.parse(nama,index_col = [0]))
        sheet_temp.append(name)
    data.append(data_sheet)
    sheet_names.append(sheet_temp)

In this way, you will get data from each sheet automatically for each excel file, but it will give error where you have file with difference extension in the same folder (for example in the same folder you have.csv file).这样,您将自动从每个工作表中获取每个 excel 文件的数据,但是如果您在同一文件夹中具有不同扩展名的文件(例如在同一文件夹中您有.csv 文件),则会出现错误。 So you need to filter all of file name first or you can use try except statement to skip non excel file.因此,您需要先过滤所有文件名,或者您可以使用try except语句跳过非 excel 文件。 If your.py file have difference path with your folder target, just change dirpath, example: 'D:/changeYour Folder Path/Example/Target/'如果您的.py 文件与您的文件夹目标路径不同,只需更改 dirpath,例如: 'D:/changeYour Folder Path/Example/Target/'

Note: You need to install openpyxl注意:需要安装openpyxl

This problem could be solved by cleaning xlsx styles completely , here is my code how to do it with pandas though openpyxl https://stackoverflow.com/a/71526058/1731460这个问题可以通过完全清理 xlsx styles来解决,这是我的代码如何使用pandas通过openpyxl https://stackoverflow.com/a/71526058/1731460

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM