简体   繁体   中英

How to remove parent element in BeautifulSoup?

Given this html structure

<strong><a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)</strong> has released an employment notification for the recruitment of <strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy</strong> 

I need to remove the entire element/tag if the html structure has fertilizer.com in it

So that the final results should be:

null

I learned there is a decompose() method in bs4 to remove elements, but how to do it for the parent element, how to navigate to it.

Please guide me. Thanks

Given the only provided piece of HTML, this would be my solution

from bs4 import BeautifulSoup

txt = '''
<strong>
    <a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong> 
has released an employment notification for the recruitment of 
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong> 
'''

soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:\n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
    soup.decompose()
print(f'Content After decomposition:\n{soup}')
# <None></None>

Another solution, in case you just want to get nothing back, is the following; note that the second loop, is to remove the free text which is not inclosed in a particular tag

from bs4 import BeautifulSoup


txt = '''
<strong>
    <a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong> 
has released an employment notification for the recruitment of 
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong> 
'''

soup = BeautifulSoup(txt, 'html.parser')

print(f'Content Before decomposition:\n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
    # Handles tags
    for el in soup.find_all():
        el.replaceWith("")
    # Handles free text like: 'has released an employment notification for the recruitment of ' (bevause is not in a particular tag) 
    for el in soup.find_all(text=True):
        el.replaceWith("")
print(f'Content After decomposition:\n{soup}')

Related Documentation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM