[英]BeautifulSoup remove special characters of contents of tag
I want to compare a string with the contents of a html page. 我想将字符串与html页面的内容进行比较。 But the special characters in the HTML page makes this comparison harder.
但是HTML页面中的特殊字符使这种比较更加困难。 So I want to remove all the special characters and white spaces from the HTML page before comparison.
因此,我想在比较之前从HTML页面中删除所有特殊字符和空格。 But all the tags must remain the same.
但是所有标签必须保持相同。 that is
那是
<div class="abc bcd">
<div class="inner1"> Hai ! this is first inner div;</div>
<div class="inner2"> "this is second div... " </div>
</div>
this should be converted to 这应该转换为
<div class="abc bcd">
<div class="inner1">Haithisisfirstinnerdiv</div>
<div class="inner2">thisisseconddiv</div>
</div>
How this can be done? 如何做到这一点?
Find all the leaf tags and change their strings. 找到所有叶子标签并更改其字符串。
alphabet = 'abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
def replace(soup):
for child in soup.children:
if child.string:
child.string = ''.join([ch for ch in child.string if ch in alphabet])
else:
replace(child)
from bs4 import BeautifulSoup
orig_string = """
<div class="abc bcd">
<div class="inner1"> Hai ! this is first inner div;</div>
<div class="inner2"> "this is second div... " </div>
</div> """
soup = BeautifulSoup(orig_string)
print soup.prettify() # original HTML
replace(soup)
print
print soup.prettify() # new HTML
Output: 输出:
<html>
<body>
<div class="abc bcd">
<div class="inner1">
Hai ! this is first inner div;
</div>
<div class="inner2">
"this is second div... "
</div>
</div>
</body>
</html>
<html>
<body>
<div class="abc bcd">
<div class="inner1">
Haithisisfirstinnerdiv
</div>
<div class="inner2">
thisisseconddiv
</div>
</div>
</body>
</html>
First off, BeautifulSoup
already fixes up some broken HTML when calling BeautifulSoup()
So: 首先,
BeautifulSoup
在调用BeautifulSoup()
时已经修复了一些损坏的HTML,因此:
<div class="inner1">
Goes to 前往
<div class="inner1">
Here's how to get rid of the whitespace and special characters: 这是摆脱空白和特殊字符的方法:
>>> from bs4 import BeautifulSoup
>>> html = """<div class="abc bcd">
<div class="inner1"> Hai ! this is first inner div;</div>
<div class="inner2"> "this is second div... " </div>
</div>"""
>>> soup = BeautifulSoup(html)
>>> for divtag in soup.findAll('div'):
... if 'inner' in divtag['class'][0]:
... divtag.string = ''.join(i for i in divtag.string if i.isalnum())
>>> print soup
<html><body><div class="abc bcd">
<div class="inner1">Haithisisfirstinnerdiv</div>
<div class="inner2">thisisseconddiv</div>
</div></body></html>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.