简体   繁体   中英

How to prevent lxml from converting '&' character to '&'?

I need to send the control characters 
 and 
 in my XML file so that the text is displayed correctly in the target system.

For the creation of the XML file I use the lxml library. This is my attempt:

from lxml import etree as et
import lxml.builder

e = lxml.builder.ElementMaker()

xml_doc = e.newOrderRequest(
    e.Orders(
        e.Order(
            e.OrderNumber('12345'),
            e.OrderID('001'),
            e.Articles(
                e.Article(
                    e.ArticleNumber('000111'),
                    e.ArticleName('Logitec Mouse'),
                    e.ArticleDescription('* 4 Buttons
* 600 DPI
* Bluetooth')
                )
            )
        )
    )
)

tree = et.ElementTree(xml_doc)
tree.write('output.xml', pretty_print=True, xml_declaration=True, encoding="utf-8")

This is the result:

<?xml version='1.0' encoding='UTF-8'?>
<newOrderRequest>
  <Orders>
    <Order>
      <OrderNumber>12345</OrderNumber>
      <OrderID>001</OrderID>
      <Articles>
        <Article>
          <ArticleNumber>000111</ArticleNumber>
          <ArticleName>Logitec Mouse</ArticleName>
          <ArticleDescription>* 4 Buttons&amp;#x0D;&amp;#x0A;* 600 DPI&amp;#x0D;&amp;#x0A;* Bluetooth</ArticleDescription>
        </Article>
      </Articles>
    </Order>
  </Orders>
</newOrderRequest>

This is what I need:

<ArticleDescription>* 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth</ArticleDescription>

Is there a function in the lxml library to turn off the conversion or does anyone know a way to solve this problem? Thanks in advance.

This is not a python or lxml issue - it is how XML parsers and serializers work. If you want to use a specific character in your programming language, then make it that character. The serializer will convert it into an entity reference if required, and the parser will convert it back when reading the document. You cannot turn it off - it would be against the specification.

An exception might be to use a CDATA section as explained in What does <?[CDATA[]]> in XML mean?

The output of the Python script:

import lxml.etree as et
print(repr(et.fromstring('''<ArticleDescription>* 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth</ArticleDescription>''').text))

...is...

'* 4 Buttons\r\n* 600 DPI\r\n* Bluetooth'

That means that the Python-syntax way to write the XML-syntax string * 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth is as '* 4 Buttons\r\n* 600 DPI\r\n* Bluetooth' .

Thus, the relevant line of code should be:

e.ArticleDescription('* 4 Buttons\r\n* 600 DPI\r\n* Bluetooth')

...and if the consumer doesn't treat the resulting output as exactly identical to import lxml.etree as et print(repr(et.fromstring('''<ArticleDescription>* 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth</ArticleDescription> , that consumer is broken.

See https://replit.com/@CharlesDuffy2/ImportantClassicConversion#test.py running your code with the modification suggested above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM