Python lxml：如何處理解析 xml 字符串的編碼錯誤？

Question

我在解析 xml 數據方面需要幫助。 這是場景：

我將 xml 文件作為字符串加載到 postgresql 數據庫。
我將它們下載到文本文件中以供進一步分析。 每行對應一個 xml 文件。
字符串有不同的編碼。 一些明確指定utf-8 ，其他windows-1252 。 可能還有其他人； 有些沒有在字符串中指定編碼。
我需要解析這些字符串以獲取數據。 我發現的最佳方法如下：

encoded_string = bytes(bytearray(xml_data, encoding='utf-8'))
root = etree.fromstring(encoded_string)

當它不起作用時，我會收到兩種類型的錯誤消息：

"Extra content at the end of the document, line 1, column x (<string>, line 1)" 
# x varies with string; I think it corresponds to the last character in the line

查看引發異常的行，看起來額外內容錯誤是由具有windows-1252編碼的文件引發的。

我需要能夠解析每個字符串，理想情況下無需在下載后以任何方式更改它們。 我嘗試了以下方法：

改為應用“windows-1252”作為編碼。
將字符串讀取為二進制，然后應用編碼
將字符串讀取為二進制並直接使用etree.fromstring進行轉換

最后一次嘗試產生了這個錯誤： ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

我能做些什么？ 我需要能夠讀取這些字符串，但不知道如何解析它們。 使用 windows 編碼的 xml 字符串都以<?xml version="1.0" encoding="windows-1252"?>開頭

Answer 1

given that the table column is text, all the XML content is being presented to python in UTF-8, as a result attempting to parse a conflicting XML encoding attribute will cause problems.

也許嘗試從字符串中剝離該屬性。

Answer 2

我通過刪除編碼信息、換行文字和回車文字解決了這個問題。 如果我在 vim 中打開返回錯誤的文件並運行以下三個命令，則每個字符串都被成功解析：

:%s/\\r//g
:%s/\\n//g
:%s/<?.*?>//g

然后 lxml 解析字符串沒有問題。

更新：

我有更好的解決方案。 問題是我正在復制到文本文件的 UTF-8 編碼字符串中的 \n 和 \r 文字。 我只需要使用regexp_replace從字符串中刪除這些字符，如下所示：

select regexp_replace(xmlcolumn, '\\n|\\r', '', 'g') from table;

現在我可以運行以下命令並使用 lxml 讀取數據而無需進一步處理：

psql -d database -c "copy (select regexp_replace(xml_column, '\\n|\\r', '', 'g') from resource ) to stdout" > output.txt

Python lxml：如何處理解析 xml 字符串的編碼錯誤？

問題描述

2 個解決方案

解決方案1
0 2020-06-27 05:02:28

解決方案2
0 2020-06-30 21:57:02

Python lxml：如何處理解析 xml 字符串的編碼錯誤？

問題描述

2 個解決方案

解決方案1 0 2020-06-27 05:02:28

解決方案2 0 2020-06-30 21:57:02

解決方案1
0 2020-06-27 05:02:28

解決方案2
0 2020-06-30 21:57:02