I'm using PDFminer , but it contains a bug and I get the following invalid XML file:
<?xml version="1.1" encoding="UTF-8"?>
<string size="16">ô‚ÌfƇ*š]Ö[</string>
When I'm trying to parse it with ElementTree
I'm getting the following error:
bookXml = xml.etree.ElementTree.parse(filename)
File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 1196, in parse
tree.parse(source, parser)
File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 597, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 36
I think best way to handle this case is to fix XML first, but how?
I would wrap the offending XML string in CDATA. Confirmed working as soon as I did this. Example:
<?xml version="1.1" encoding="UTF-8"?>
<string><![CDATA[ô‚ÌƇ*šÖ]]></string>
More about CDATA here .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.