Why are non-ASCII characters escaped in attribute-values after writing an XML-file with lxml?

Question

I'm trying to continually build an xml-file with python and with etree.xmlfile from lxml .

My input is an XML-file, where there are umlauts in attribute values. I read this in with lxml , make some changes to the names of the attributes, and then write it to a new file.

This is my code, broken down:

with etree.xmlfile(path_to_new_file, encoding="utf8") as xf:                                             
    with xf.element("corpus"):                                                                      
        for _, element in etree.iterparse(path_to_original_file, tag="comment"):                                               
            new_element = transform_element(element)                                                                                                   
            xf.write(new_element)
            del element
            del new_element

In the original file, I might have an element like this:

<comment title="Kübel">Some text with umlauts like this üä</comment>

But after processing, the same comment in the new file looks like this:

<comment title="Kübel">Some text with umlauts like this üä</comment>

Do you have any idea what might cause this?

Answer 1

ü does not have to be escaped in an XML attribute value (or in a text node child of an element).

Probably the developer of the library was being overly cautious and called an generic escape string function, possibly to leverage its escaping of < , which always has to be escaped, and ' or " which have to be escaped when matching the delimiting quotation mark for the attribute value.

For precise escaping requirements concisely presented, see Simplified XML Escaping .

Why are non-ASCII characters escaped in attribute-values after writing an XML-file with lxml?

Question

1 answers

solution1
2 2020-06-12 16:48:01

Why are non-ASCII characters escaped in attribute-values after writing an XML-file with lxml?

Question

1 answers

solution1 2 2020-06-12 16:48:01

solution1
2 2020-06-12 16:48:01