简体   繁体   中英

How to remove any html tags within a specific pattern in beautifulsoup

 <p> A <span>die</span> is thrown \\(x = {-b \\pm <span>\\sqrt</span> {b^2-4ac} \\over 2a}\\) twice. What is the probability of getting a sum 7 from both the throws? </p>

In above html I need to remove only the tags within "\\(tags\\)" ie \\(x = {-b \\pm <span>\\sqrt</span> {b^2-4ac} \\over 2a}\\\\) . I have just started with beautifulsoup is there any way this can be achieved with beautifulsoup?

I came up with the solution to my question. Hope it helps others. Feel free to give me suggestion to improve the code.

from bs4 import BeautifulSoup
import re
html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p> <p> Test </p>"""

soup = BeautifulSoup(html, 'html.parser')
mathml_start_regex = re.compile(r'\\\(')
mathml_end_regex = re.compile(r'\\\)')

for p_tags in soup.find_all('p'):
    match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found.
    for p_child in p_tags.children:
        try: #Captures Tags that contains \(
            if re.findall(mathml_start_regex, p_child.text):
                match += 1
        except: #Captures NavigableString that contains \(
            if re.findall(mathml_start_regex, p_child):
                match += 1
        try: #Replaces Tag with Tag's text
            if match == 1:
                p_child.replace_with(p_child.text)
        except: #No point in replacing NavigableString since they are just strings without Tags
            pass
        try: #Captures Tags that contains \)
            if re.findall(mathml_end_regex, p_child.text):
                match = 0
        except: #Captures NavigableString that contains \)
            if re.findall(mathml_end_regex, p_child):
                match = 0

Output:

<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      \sqrt
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>
<p> Test
</p>

In the above code I searched all 'p' tag and it returns bs4.element.ResultSet . In the first for loop I am iterating to the result set to get individual 'p' tags and in the second for loop and using the . children generator to iterate through the 'p' tags children (contains both navigable string and tags). Each 'p' tag's child is searched for the '\\(', if found the match is set to 1 and if when iterating to the children that match is 1 then the tags in the particular child is removed using replace_with and finally the match is set to zero when '\\)' is found.

Beautiful soup alone can't get a substring. You can use regex along with it.

from bs4 import BeautifulSoup
import re

html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>"""

soup = BeautifulSoup(html, 'html.parser')

print re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)

Output:

[u'\\(x = {-b \\pm \n  \\sqrt\n  {b^2-4ac} \\over 2a}\\)']

Regex:

\\\(.*?\) - Get substring from ( to ).

If you want to strip the newlines and whitespaces, you can do it like so:

res = re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)[0]
print ' '.join(res.split())

Output:

\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)

HTML wrappers around the string:

print BeautifulSoup(' '.join(res.split()))

Output:

<html><body><p>\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)</p></body></html>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM