繁体   English   中英

如何删除beautifulsoup中特定模式中的任何html标签

[英]How to remove any html tags within a specific pattern in beautifulsoup

 <p> A <span>die</span> is thrown \\(x = {-b \\pm <span>\\sqrt</span> {b^2-4ac} \\over 2a}\\) twice. What is the probability of getting a sum 7 from both the throws? </p>

在上面的 html 中,我只需要删除 "\\(tags\\)" 中的标签,即\\(x = {-b \\pm <span>\\sqrt</span> {b^2-4ac} \\over 2a}\\\\) 我刚刚开始使用beautifulsoup 有什么办法可以用beautifulsoup 实现吗?

我想出了我的问题的解决方案。 希望它可以帮助其他人。 请随时给我建议以改进代码。

from bs4 import BeautifulSoup
import re
html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p> <p> Test </p>"""

soup = BeautifulSoup(html, 'html.parser')
mathml_start_regex = re.compile(r'\\\(')
mathml_end_regex = re.compile(r'\\\)')

for p_tags in soup.find_all('p'):
    match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found.
    for p_child in p_tags.children:
        try: #Captures Tags that contains \(
            if re.findall(mathml_start_regex, p_child.text):
                match += 1
        except: #Captures NavigableString that contains \(
            if re.findall(mathml_start_regex, p_child):
                match += 1
        try: #Replaces Tag with Tag's text
            if match == 1:
                p_child.replace_with(p_child.text)
        except: #No point in replacing NavigableString since they are just strings without Tags
            pass
        try: #Captures Tags that contains \)
            if re.findall(mathml_end_regex, p_child.text):
                match = 0
        except: #Captures NavigableString that contains \)
            if re.findall(mathml_end_regex, p_child):
                match = 0

输出:

<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      \sqrt
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>
<p> Test
</p>

在上面的代码中,我搜索了所有 'p' 标签,它返回bs4.element.ResultSet 在第一个 for 循环中,我迭代结果集以获取单独的“p”标签,并在第二个 for 循环中使用 . children生成器迭代“p”标签子项(包含可导航的字符串和标签)。 每个 'p' 标签的孩子都被搜索 '\\(',如果找到匹配设置为 1,如果迭代到匹配的孩子时为 1 然后使用replace_with删除特定孩子中的标签,最后匹配是找到 '\\)' 时设置为零。

光靠美汤是拿不到子串的。 您可以将正则表达式与它一起使用。

from bs4 import BeautifulSoup
import re

html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>"""

soup = BeautifulSoup(html, 'html.parser')

print re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)

输出:

[u'\\(x = {-b \\pm \n  \\sqrt\n  {b^2-4ac} \\over 2a}\\)']

正则表达式:

\\\(.*?\) - Get substring from ( to ).

如果你想去掉换行符和空格,你可以这样做:

res = re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)[0]
print ' '.join(res.split())

输出:

\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)

字符串周围的 HTML 包装器:

print BeautifulSoup(' '.join(res.split()))

输出:

<html><body><p>\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)</p></body></html>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM