简体   繁体   中英

reference to invalid character number: (Python ElementTree parse)

I have xml file which has following content:

    <word>vegetation</word>
    <word>cover</word>
    <word>(&#x2;31%</word>
    <word>split_identifier ;</word>
    <word>Still</word>
    <word>and</word>

When I read the file using ElmentTree parse, it gives me error :

xml.etree.ElementTree.ParseError: reference to invalid character number

Its becuase of (&#x2 which is "~").

How can I take care of such issues. I am not sure how many other symbols i would get in future.

If you want to get rid of those special characters, you can by scrubbing the input XML as a string:

respXML = response.content.decode("utf-16")

scrubbedXML = re.sub('&.+[0-9]+;', '', respXML)

respRoot = ET.fromstring(scrubbedXML)

If you prefer to keep the special characters you may parse them beforehand. In your case it looks like html, therefore you may use the python html module:

import html
respRoot = ET.fromstring(html.unescape(response.content.decode("utf-16"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM