Python lxml：如何处理解析 xml 字符串的编码错误？

Question

我在解析 xml 数据方面需要帮助。 这是场景：

我将 xml 文件作为字符串加载到 postgresql 数据库。
我将它们下载到文本文件中以供进一步分析。 每行对应一个 xml 文件。
字符串有不同的编码。 一些明确指定utf-8 ，其他windows-1252 。 可能还有其他人； 有些没有在字符串中指定编码。
我需要解析这些字符串以获取数据。 我发现的最佳方法如下：

encoded_string = bytes(bytearray(xml_data, encoding='utf-8'))
root = etree.fromstring(encoded_string)

当它不起作用时，我会收到两种类型的错误消息：

"Extra content at the end of the document, line 1, column x (<string>, line 1)" 
# x varies with string; I think it corresponds to the last character in the line

查看引发异常的行，看起来额外内容错误是由具有windows-1252编码的文件引发的。

我需要能够解析每个字符串，理想情况下无需在下载后以任何方式更改它们。 我尝试了以下方法：

改为应用“windows-1252”作为编码。
将字符串读取为二进制，然后应用编码
将字符串读取为二进制并直接使用etree.fromstring进行转换

最后一次尝试产生了这个错误： ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

我能做些什么？ 我需要能够读取这些字符串，但不知道如何解析它们。 使用 windows 编码的 xml 字符串都以<?xml version="1.0" encoding="windows-1252"?>开头

Answer 1

given that the table column is text, all the XML content is being presented to python in UTF-8, as a result attempting to parse a conflicting XML encoding attribute will cause problems.

也许尝试从字符串中剥离该属性。

Answer 2

我通过删除编码信息、换行文字和回车文字解决了这个问题。 如果我在 vim 中打开返回错误的文件并运行以下三个命令，则每个字符串都被成功解析：

:%s/\\r//g
:%s/\\n//g
:%s/<?.*?>//g

然后 lxml 解析字符串没有问题。

更新：

我有更好的解决方案。 问题是我正在复制到文本文件的 UTF-8 编码字符串中的 \n 和 \r 文字。 我只需要使用regexp_replace从字符串中删除这些字符，如下所示：

select regexp_replace(xmlcolumn, '\\n|\\r', '', 'g') from table;

现在我可以运行以下命令并使用 lxml 读取数据而无需进一步处理：

psql -d database -c "copy (select regexp_replace(xml_column, '\\n|\\r', '', 'g') from resource ) to stdout" > output.txt

Python lxml：如何处理解析 xml 字符串的编码错误？

问题描述

2 个解决方案

解决方案1
0 2020-06-27 05:02:28

解决方案2
0 2020-06-30 21:57:02

Python lxml：如何处理解析 xml 字符串的编码错误？

问题描述

2 个解决方案

解决方案1 0 2020-06-27 05:02:28

解决方案2 0 2020-06-30 21:57:02

解决方案1
0 2020-06-27 05:02:28

解决方案2
0 2020-06-30 21:57:02