简体   繁体   中英

How to prevent python BeautifulSoup from replacing escape sequences with hex codes?

I am trying to use BeautifulSoup in python script that helps me avoid manual work in mass updates in IBM IDA (Infosphere Data Architect) ldm (Logical Data Model) files which are actually xml. It works fine for me except for some side effect. description attribute in xml can contain some formatting with control characters encoded as escape sequences like &#xD , &#xA , &#x9 . On output in my script they are converted to hex 0D 0A 09 . I do not know how to avoid it. To illustrate the effect I simplified my script so that it just reads the model and writes it out to another file.

from bs4 import BeautifulSoup
#import os

source_modlel_file_name="TestModel.ldm"
target_model_file_name="TestModel_out.ldm"

with open(source_modlel_file_name,'r',encoding="utf-8",newline="\r\n") as source_model_file:
    source_model = source_model_file.read()

soup_model=BeautifulSoup(source_model, "xml")

with open(target_model_file_name, "w",encoding="utf-8",newline="\r\n") as file:
    file.write(str(soup_model))

One solution is to use custom formatter:

from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter


class CustomAttributes(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            v = v.replace("\r", "
")
            v = v.replace("\n", "
")
            v = v.replace("\t", "	")
            yield k, v


xml_doc = """<test>
    <data description="Some Text &#xD; &#xA; &#x9;">
        some data
    </data>
</test>"""

soup = BeautifulSoup(xml_doc, "xml")

print(soup.prettify(formatter=CustomAttributes()))

Prints:

<?xml version="1.0" encoding="utf-8"?>
<test>
 <data description="Some Text &#xD; &#xA; &#x9;">
  some data
 </data>
</test>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM