如何简单地从 BeautifulSoup 中找到的元素中删除所有标签?
With BeautifulStoneSoup
gone in bs4
, it's even simpler in Python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()
print(text)
why has no answer I've seen mentioned anything about the unwrap
method? Or, even easier, the get_text
method
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
Use get_text() , it returns all the text in a document or beneath a tag, as a single Unicode string.
For instance, remove all different script tags from the following text:
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
The expected result is:
Signal et Communication
Ingénierie Réseaux et Télécommunications
Here is the source code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
text = '''
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())
You can use the decompose method in bs4:
soup = bs4.BeautifulSoup('<body><a href="http://example.com/">I linked to <i>example.com</i></a></body>')
for a in soup.find('a').children:
if isinstance(a,bs4.element.Tag):
a.decompose()
print soup
Out: <html><body><a href="http://example.com/">I linked to </a></body></html>
Code to simply get the contents as text instead of html:
'html_text' parameter is the string which you will pass in this function to get the text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, 'lxml')
text = soup.get_text()
print(text)
it looks like this is the way to do! as simple as that
with this line you are joining together the all text parts within the current element
''.join(htmlelement.find(text=True))
Here is the source code: you can get the text which is exactly in the URL
URL = ''
page = requests.get(URL)
soup = bs4.BeautifulSoup(page.content,'html.parser').get_text()
print(soup)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.