[英]Extract text only except the content of script tag from html with BeautifulSoup
我有像這樣的HTML
<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>
我試圖使用BeautifulSoup
提取Age 15
所以我寫了如下python代碼
碼:
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')
age = soup.find("span", {"class": "age"})
print(age.text)
輸出:
Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
我只希望Age 15
不是script
標簽內的功能。 有沒有辦法只獲得文字: Age 15
? 或以任何方式排除script
標簽的內容?
PS:腳本標簽太多,URL不同。 我不喜歡從輸出中替換文本。
使用.find(text=True)
EX:
from bs4 import BeautifulSoup
html = """<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find("span", {"class": "age"}).find(text=True).strip())
輸出:
Ages 15
遲到的答案,但為了將來參考,您還可以使用decompose()從html
刪除所有script
元素,即:
soup = BeautifulSoup(html, "html.parser")
# remove script and style elements
for script in soup(["script", "style"]):
script.decompose()
print(soup.find("span", {"class": "age"}).text.strip())
# Ages 15
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.