Given this html structure
<strong><a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)</strong> has released an employment notification for the recruitment of <strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy</strong>
I need to remove the entire element/tag if the html structure has fertilizer.com
in it
So that the final results should be:
null
I learned there is a decompose()
method in bs4 to remove elements, but how to do it for the parent element, how to navigate to it.
Please guide me. Thanks
Given the only provided piece of HTML, this would be my solution
from bs4 import BeautifulSoup
txt = '''
<strong>
<a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong>
has released an employment notification for the recruitment of
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:\n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
soup.decompose()
print(f'Content After decomposition:\n{soup}')
# <None></None>
Another solution, in case you just want to get nothing back, is the following; note that the second loop, is to remove the free text which is not inclosed in a particular tag
from bs4 import BeautifulSoup
txt = '''
<strong>
<a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong>
has released an employment notification for the recruitment of
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:\n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
# Handles tags
for el in soup.find_all():
el.replaceWith("")
# Handles free text like: 'has released an employment notification for the recruitment of ' (bevause is not in a particular tag)
for el in soup.find_all(text=True):
el.replaceWith("")
print(f'Content After decomposition:\n{soup}')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.