How to prevent python BeautifulSoup from replacing escape sequences with hex codes?

Question

I am trying to use BeautifulSoup in python script that helps me avoid manual work in mass updates in IBM IDA (Infosphere Data Architect) ldm (Logical Data Model) files which are actually xml. It works fine for me except for some side effect. description attribute in xml can contain some formatting with control characters encoded as escape sequences like &#xD , &#xA , &#x9 . On output in my script they are converted to hex 0D 0A 09 . I do not know how to avoid it. To illustrate the effect I simplified my script so that it just reads the model and writes it out to another file.

from bs4 import BeautifulSoup
#import os

source_modlel_file_name="TestModel.ldm"
target_model_file_name="TestModel_out.ldm"

with open(source_modlel_file_name,'r',encoding="utf-8",newline="\r\n") as source_model_file:
    source_model = source_model_file.read()

soup_model=BeautifulSoup(source_model, "xml")

with open(target_model_file_name, "w",encoding="utf-8",newline="\r\n") as file:
    file.write(str(soup_model))

Answer 1

One solution is to use custom formatter:

from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter


class CustomAttributes(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            v = v.replace("\r", "&#xD;")
            v = v.replace("\n", "&#xA;")
            v = v.replace("\t", "&#x9;")
            yield k, v


xml_doc = """<test>
    <data description="Some Text &#xD; &#xA; &#x9;">
        some data
    </data>
</test>"""

soup = BeautifulSoup(xml_doc, "xml")

print(soup.prettify(formatter=CustomAttributes()))

Prints:

<?xml version="1.0" encoding="utf-8"?>
<test>
 <data description="Some Text &#xD; &#xA; &#x9;">
  some data
 </data>
</test>

How to prevent python BeautifulSoup from replacing escape sequences with hex codes?

Question

1 answers

solution1
1 2021-03-22 22:08:50

How to prevent python BeautifulSoup from replacing escape sequences with hex codes?

Question

1 answers

solution1 1 2021-03-22 22:08:50

solution1
1 2021-03-22 22:08:50