简体   繁体   中英

Python XML Parse errors with Invalid Token

I am moving code from Python 2.7 to Python 3.10. One section of the code creates XML which is 'prettyfied' and written to a file. But parsing in Python 3.x is throwing an error. In one case the problem seems to be with an encoded en-dash character.

<?xml version='1.0' encoding='utf8'?>
<properties>
    <entry key="name">AB&amp;R - RFA #3 \xe2\x80\x93 Alignment</entry>
</properties>

The parsing is done as follows:

xml_parsed = xml.dom.minidom.parseString(xml_string)
return xml_parsed.toprettyxml("    ", "\n")

The error thrown is:

not well-formed (invalid token): line 2

I don't think this problem happened with Python 2.7. There is a nice description about en-dash here (although I think my problem is not limited to en-dash).

What can be done to fix this?

My original description of the problem was not right. The XML text was stored as a byte string. The following code worked for me:

    xml_string = xml_string.decode("utf-8")
    xml_parsed = xml.dom.minidom.parseString(xml_string)
    return xml_parsed.toprettyxml("    ", "\n")

I didn't need to do the utf-8 decode in Python 2.7.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM