在python中用“格式不正確”的字符解析xml

Question

我正在從應用程序獲取 xml 數據，我想在 python 中對其進行解析：

#!/usr/bin/python

import xml.etree.ElementTree as ET
import re

xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()

xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)

它適用於帶有示例數據的較小數據集，但是當我使用真實的實時數據時，我得到

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72

查看 xml 文件，我看到這行 364658：

WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>

我猜是^[讓 python 窒息 - 它也在 vim 中突出顯示為藍色。 現在我希望我可以用我的正則表達式替換來清理數據，但這不起作用。

最好的辦法是修復生成 xml 的應用程序，但這超出了范圍。 所以我需要按原樣處理數據。 我該如何解決這個問題？ 我可以忍受只是扔掉“非法”角色。

Answer 1

你已經這樣做了：

xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)

但字符^[可能是 Python 的\x1b 。 如果 xml.parser.expat 卡住它，您只需要清理更多內容，只接受 0x20（空格）以下的一些字符。 例如：

xml_cleaned = re.sub(u'[^\n\r\t\x20-\x7f]+',u'',xml_as_string)

Answer 2

我知道這已經很老了，但偶然發現了以下 url，其中包含所有主要字符及其編碼的列表。

https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

在python中用“格式不正確”的字符解析xml

問題描述

2 個解決方案

解決方案1
3 已采納 2013-10-29 09:44:34

解決方案2
0 2019-04-01 20:41:09

在python中用“格式不正確”的字符解析xml

問題描述

2 個解決方案

解決方案1 3 已采納 2013-10-29 09:44:34

解決方案2 0 2019-04-01 20:41:09

解決方案1
3 已采納 2013-10-29 09:44:34

解決方案2
0 2019-04-01 20:41:09