BeautifulSoup删除标签内容的特殊字符

Question

I want to compare a string with the contents of a html page. 我想将字符串与html页面的内容进行比较。 But the special characters in the HTML page makes this comparison harder. 但是HTML页面中的特殊字符使这种比较更加困难。 So I want to remove all the special characters and white spaces from the HTML page before comparison. 因此，我想在比较之前从HTML页面中删除所有特殊字符和空格。 But all the tags must remain the same. 但是所有标签必须保持相同。 that is 那是

<div class="abc bcd">
         <div  class="inner1"> Hai ! this is first inner div;</div>
         <div class="inner2"> "this is second div... " </div>
</div>

this should be converted to 这应该转换为

<div class="abc bcd">
          <div class="inner1">Haithisisfirstinnerdiv</div>
          <div class="inner2">thisisseconddiv</div>
</div>

How this can be done? 如何做到这一点？

Answer 1

Find all the leaf tags and change their strings. 找到所有叶子标签并更改其字符串。

alphabet = 'abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

def replace(soup):
    for child in soup.children:
        if child.string:
            child.string = ''.join([ch for ch in child.string if ch in alphabet])
        else:
            replace(child)

from bs4 import BeautifulSoup

orig_string = """
<div class="abc bcd">
         <div  class="inner1"> Hai ! this is first inner div;</div>
         <div class="inner2"> "this is second div... " </div>
</div> """

soup = BeautifulSoup(orig_string)
print soup.prettify() # original HTML
replace(soup)
print
print soup.prettify() # new HTML

Output: 输出：

<html>
 <body>
  <div class="abc bcd">
   <div class="inner1">
    Hai ! this is first inner div;
   </div>
   <div class="inner2">
    "this is second div... "
   </div>
  </div>
 </body>
</html>

<html>
 <body>
  <div class="abc bcd">
   <div class="inner1">
    Haithisisfirstinnerdiv
   </div>
   <div class="inner2">
    thisisseconddiv
   </div>
  </div>
 </body>
</html>

Answer 2

First off, BeautifulSoup already fixes up some broken HTML when calling BeautifulSoup() So: 首先， BeautifulSoup在调用BeautifulSoup()时已经修复了一些损坏的HTML，因此：

<div  class="inner1">

Goes to 前往

<div class="inner1">

Here's how to get rid of the whitespace and special characters: 这是摆脱空白和特殊字符的方法：

>>> from bs4 import BeautifulSoup
>>> html = """<div class="abc bcd">
     <div  class="inner1"> Hai ! this is first inner div;</div>
     <div class="inner2"> "this is second div... " </div>
</div>""" 
>>> soup = BeautifulSoup(html)
>>> for divtag in soup.findAll('div'):
...     if 'inner' in divtag['class'][0]:
...         divtag.string = ''.join(i for i in divtag.string if i.isalnum())
>>> print soup
<html><body><div class="abc bcd">
<div class="inner1">Haithisisfirstinnerdiv</div>
<div class="inner2">thisisseconddiv</div>
</div></body></html>

BeautifulSoup删除标签内容的特殊字符

问题描述

2 个解决方案

解决方案1
1 2013-05-12 04:26:38

解决方案2
0 2013-05-12 04:32:59

BeautifulSoup删除标签内容的特殊字符

问题描述

2 个解决方案

解决方案1 1 2013-05-12 04:26:38

解决方案2 0 2013-05-12 04:32:59

解决方案1
1 2013-05-12 04:26:38

解决方案2
0 2013-05-12 04:32:59