简体   繁体   中英

XML Parse Error with invalid HTML code (Elementtree)

When I parse the xml string below taken from a larger xml file, I run into what I think is an invalid HTML character code, the parser outputs the following error message.

The error message was: ParseError: reference to invalid character number

I deleted the rest of the body of description and left the part that caused the error. How do I get elementtree to ignore these invalid HTML character codes or process them in some way?

The code and xml excerpt is below:

XML: <dc:description> **(10&#410)** </dc:description>


import os
import html
import io
import sys
import xml.etree.ElementTree as ET

def process_file(file):

    parser=ET.XMLParser(encoding='utf-8')
    tree=ET.parse(file, parser=parser)


How do I get elementtree to ignore these invalid HTML character codes or process them in some way?

You don't

You're trying to apply an XML tool to non-XML data. It's properly refusing to cooperate.

The solution is to first fix your data to be XML before trying to process it as XML. Do this manually, or try to do it programmatically by processing the document at the character/string level.

See also How to parse invalid (bad / not well-formed) XML?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM