简体   繁体   中英

Why are non-ASCII characters escaped in attribute-values after writing an XML-file with lxml?

I'm trying to continually build an xml-file with python and with etree.xmlfile from lxml .

My input is an XML-file, where there are umlauts in attribute values. I read this in with lxml , make some changes to the names of the attributes, and then write it to a new file.

This is my code, broken down:

with etree.xmlfile(path_to_new_file, encoding="utf8") as xf:                                             
    with xf.element("corpus"):                                                                      
        for _, element in etree.iterparse(path_to_original_file, tag="comment"):                                               
            new_element = transform_element(element)                                                                                                   
            xf.write(new_element)
            del element
            del new_element

In the original file, I might have an element like this:

<comment title="Kübel">Some text with umlauts like this üä</comment>

But after processing, the same comment in the new file looks like this:

<comment title="K&#xFC;bel">Some text with umlauts like this üä</comment>

Do you have any idea what might cause this?

ü does not have to be escaped in an XML attribute value (or in a text node child of an element).

Probably the developer of the library was being overly cautious and called an generic escape string function, possibly to leverage its escaping of < , which always has to be escaped, and ' or " which have to be escaped when matching the delimiting quotation mark for the attribute value.

For precise escaping requirements concisely presented, see Simplified XML Escaping .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM