[英]Extract text only except the content of script tag from html with BeautifulSoup
I have html like this 我有像这样的HTML
<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>
I am trying to extract Age 15
using BeautifulSoup
我试图使用
BeautifulSoup
提取Age 15
So i written python code as follows 所以我写了如下python代码
code: 码:
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')
age = soup.find("span", {"class": "age"})
print(age.text)
output: 输出:
Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
I want only Age 15
not the function inside script
tag. 我只希望
Age 15
不是script
标签内的功能。 Is there any way to get only text: Age 15
? 有没有办法只获得文字:
Age 15
? or any way to exclude the content of script
tag? 或以任何方式排除
script
标签的内容?
PS: there are too many script tags and different URLS.
PS:脚本标签太多,URL不同。 I don't prefer replace text from the output.
我不喜欢从输出中替换文本。
Use .find(text=True)
使用
.find(text=True)
EX: EX:
from bs4 import BeautifulSoup
html = """<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find("span", {"class": "age"}).find(text=True).strip())
Output: 输出:
Ages 15
Late answer, but for future reference, you can also use decompose() to remove all script
elements from the html
, ie: 迟到的答案,但为了将来参考,您还可以使用decompose()从
html
删除所有script
元素,即:
soup = BeautifulSoup(html, "html.parser")
# remove script and style elements
for script in soup(["script", "style"]):
script.decompose()
print(soup.find("span", {"class": "age"}).text.strip())
# Ages 15
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.