简体   繁体   English

使用 Python 修复 XML 文件中的 HTML 标签

[英]Fixing HTML Tags within XML files using Python

I have been given.htm files that are structured in XML format and have HTML tags within them.我得到了 .htm 文件,这些文件以 XML 格式构建,并在其中包含 HTML 标签。 The issue is that alot of these HTML tags along the way have been converted.问题是很多这些 HTML 标签已经被转换了。 For example & lt;例如<; has been converted to <, & amp;已转换为 <, &amp; has been converted to & etc. Is there a python module that is able fix these HTML entities kindof like: HTML Corrector已转换为 & 等。是否有一个 python 模块能够修复这些 HTML 实体,例如: HTML Corrector

For example:例如:

<Employee>
  <name> Adam</name
  <age> > 24 </age>
  <Nicknames> A & B </Nicknames>
</Employee>

In this above example, the > in age would be converted to '& gt;'在上面的例子中,年龄中的 > 将被转换为 '>' and the & would converted to '& amp;' & 将转换为 '& amp;'

Desired Result:期望的结果:

<Employee>
  <name> Adam</name
  <age> &gt; 24 </age>
  <Nicknames> A &amp; B </Nicknames>
</Employee>

If the HTML is well-formed, you can just convert to a BeautifulSoup object (from beautifulsoup4 ) and the inner text of each tag will be escaped:如果 HTML 格式正确,您只需转换为 BeautifulSoup 对象(来自beautifulsoup4 ),每个标签的内部文本将被转义:

my_html = \
"""<Employee>
<name> Adam</name>
<age> > 24 </age>
<Nicknames> A & B </Nicknames>
</Employee>"""

soup = BeautifulSoup(my_html)
print(soup)

Outputs:输出:

<employee>
<name> Adam</name>
<age> &gt; 24 </age>
<nicknames> A &amp; B </nicknames>
</employee>

Not sure if this was intentional, but the exact example you provided includes a broken tag, </name without the closing > .不确定这是否是故意的,但您提供的确切示例包括一个损坏的标签, </name没有结束> You'd need to fix this which is tricker—you could maybe use a regular expression.你需要解决这个问题——你可以使用正则表达式。 This gets the correct output for your example:这会为您的示例获取正确的输出:

import re
from bs4 import BeautifulSoup

my_html = \
"""<Employee>
<name> Adam</name
<age> > 24 </age>
<Nicknames> A & B </Nicknames>
</Employee>"""

my_html = re.sub(r"</([^>]*)(\s)", r"<\1>\2", my_html)
soup = BeautifulSoup(my_html)
print(soup)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM