使用 iterparse() 或 findall() 解析嵌套元素

Question

我有 xml 个文档（用 UTF-8 编码），其结构如下：

<Group id= "123">
    <rule id= "abc" level= "low">
    <identity>some text</identity>
    <element1>text</element1>
</Group>

每个文档都有多个 Group 元素，目标是将它们解析为一个电子表格，其中每个组都是一行，其中包含组 ID、级别以及来自 identity 和 element1 元素的文本列。

我有一个使用 findall() 的脚本，当我尝试一次解析一个文档时它可以工作，但是当我尝试一次解析多个文档时它往往会失败并显示错误：

 File "c:/Documents/Python Projects/Bulkparse.py", line 86, in parseall
    writer.writerow(data)
  File "C:\Program Files (x86)\Python\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x9d' in position 1137: character maps to <undefined>

我查过'\x9d'字符代码，它似乎是某种十字图标，它没有出现在我的任何文档中。 所以我确定它发生的地点或原因。

Findall() 脚本示例：

for child in root.findall('Group'):
  data.append(child.attrib['id'])
  num = child.attrib['id']
  for child in root.findall('Group[@id = "%s"]/Rule'% num ):
    data.append(child.attrib['level'])
    # followed by a for loop for each element needed ending with
    writer.writerow(data)

上面的工作，除非我正在做大量的工作，这给了我上面的错误。

仅仅是 findall() 效率太低了吗？ 我试图用 iterparse() 写一些东西，但找不到一种方法让它遍历每个子元素。 例如：

for  event, elem in context:
    if elem.tag ==f"Group" and event == 'end':
        data.append(elem.attrib['id'])
        num = elem.attrib['id']
        for event, elem in context :
            if elem.tag ==f"Rule" and event == 'end':
                data.append(elem.attrib['level'])
                print(data)

返回组 ID，然后是每个组的级别等级，因此 [123，低，高，低，低，低，高..] 等。

使用iterparse更好吗？ 如果是这样，我有没有办法让它的目标元素标签嵌套在组元素中，就像我对 findall() 所做的那样？ 或者有没有办法让 findall() 脚本停止抛出该错误？ 有没有办法清除每个文档末尾的memory？ （假设这会有所帮助）非常感谢您的帮助。

Answer 1

通过拆分文档集并搜索继续引发错误的那一半，找出导致问题的文档。 虽然你说 '\x9d' 不在你的文档集中，但它必须以不同的编码存在。

您还没有说 XML 文档的字符编码是什么 - 也许将 XML 的字符编码更改为 UTF？

如果您看不到编码问题，则可以将导出过程切换到执行 XML 到 csv 转换的 XSL 转换。 无论如何，这可能会更好。

Answer 2

读取包含异常字符的文件时经常会出现此问题。 尝试解决它的一种方法是在打开 xml 文件时执行以下操作：

with open('myfile.xml', encoding='utf-8') as myfile:
   root = etree.XML(myfile)  #or however you import lxml and your file
   for child in root.findall('Group'):.....

这将解决大多数这些问题。 但是我遇到过很多这样的错误，有时我不得不求助于在处理文件之前实际编辑文件中更麻烦的字符。 就像是：

[string representation of your file].replace('\x9d','+') 
#or whatever other charcter you want to use to represent a cross.

使用 iterparse() 或 findall() 解析嵌套元素

问题描述

2 个解决方案

解决方案1
0 2020-08-18 23:28:52

解决方案2
-1 2020-08-19 15:10:44

使用 iterparse() 或 findall() 解析嵌套元素

问题描述

2 个解决方案

解决方案1 0 2020-08-18 23:28:52

解决方案2 -1 2020-08-19 15:10:44

解决方案1
0 2020-08-18 23:28:52

解决方案2
-1 2020-08-19 15:10:44