使用 ElementTree 读取 .xml 等电子表格

Question

我正在使用 ElementTree 读取 xml 文件，但有一个单元格无法读取其数据。

我修改了我的文件以制作一个可重复的示例，我接下来介绍：

from xml.etree import ElementTree
import io

xmlf = """<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook ss:ResourcesPackageName="" ss:ResourcesPackageVersion="" xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">
  <Worksheet ss:Name="DigitalOutput" ss:IsDeviceType="true">
     <Row ss:AutoFitHeight="0">
    <Cell><Data ss:Type="String">A</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
    <Cell><Data ss:Type="String">B</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
    <Cell><Data ss:Type="String">C</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
    <Cell ss:Index="7"><ss:Data ss:Type="String"
      xmlns="http://www.w3.org/TR/REC-html40"><Font html:Color="#000000">CAN'T READ </Font><Font>THIS</Font></ss:Data><NamedCell
      ss:Name="_FilterDatabase"/></Cell>
    <Cell ss:Index="10"><Data ss:Type="String">D</Data><NamedCell
      ss:Name="_FilterDatabase"/></Cell>
   </Row>
   </Worksheet>
 </Workbook>"""

ss = "urn:schemas-microsoft-com:office:spreadsheet"
worksheet_label = '{%s}Worksheet' % ss
row_label = '{%s}Row' % ss
cell_label = '{%s}Cell' % ss
data_label = '{%s}Data' % ss

tree = ElementTree.parse(io.StringIO(xmlf))
root = tree.getroot()

for ws in root.findall(worksheet_label):
    for table in ws.findall(row_label):
        for c in table.findall(cell_label):
            data = c.find(data_label)
            print(data.text)

输出是：

A
B
C
None
D

因此，未读取第四个单元格。 你能帮我解决这个问题吗？

Answer 1

问题：使用 ElementTree 读取 .xml 之类的电子表格

文档： lxml.etree 教程 - 命名空间

定义使用的namespaces

 ns = {'ss':"urn:schemas-microsoft-com:office:spreadsheet", 'html':"http://www.w3.org/TR/REC-html40" }

将namespaces与find(.../findall(...

 tree = ElementTree.parse(io.StringIO(xmlf)) root = tree.getroot() for ws in root.findall('ss:Worksheet', ns): for table in ws.findall('ss:Row', ns): for c in table.findall('ss:Cell', ns): data = c.find('ss:Data', ns) if data.text is None: text = [] data = data.findall('html:Font', ns) for element in data: text.append(element.text) data_text = ''.join(text) print(data_text) else: print(data.text)

输出：

 A B C CAN'T READ THIS D

用 Python 测试：3.5

Answer 2

第四个单元格的文本内容属于绑定到另一个命名空间的两个Font子元素。 演示：

for e in root.iter():
    text = e.text.strip() if e.text else None 
    if text:
        print(e, text)

输出：

<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> A
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> B
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> C
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e08> CAN'T READ
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e48> THIS
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01e48> D

使用 ElementTree 读取 .xml 等电子表格

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-01-09 14:43:47

解决方案2
1 2019-01-09 13:58:17

使用 ElementTree 读取 .xml 等电子表格

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-01-09 14:43:47

解决方案2 1 2019-01-09 13:58:17

解决方案1
2 已采纳 2019-01-09 14:43:47

解决方案2
1 2019-01-09 13:58:17