简体   繁体   English

仅使用BeautifulSoup从html中提取除脚本标记内容之外的文本

[英]Extract text only except the content of script tag from html with BeautifulSoup

I have html like this 我有像这样的HTML

<span class="age">
    Ages 15
    <span class="loc" id="loc_loads1">
     </span>
     <script>
        getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
     </script>
</span>

I am trying to extract Age 15 using BeautifulSoup 我试图使用BeautifulSoup提取Age 15

So i written python code as follows 所以我写了如下python代码

code: 码:

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)

soup = bs(page.data, 'html.parser')
age = soup.find("span", {"class": "age"})

print(age.text)

output: 输出:

Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);

I want only Age 15 not the function inside script tag. 我只希望Age 15不是script标签内的功能。 Is there any way to get only text: Age 15 ? 有没有办法只获得文字: Age 15 or any way to exclude the content of script tag? 或以任何方式排除script标签的内容?

PS: there are too many script tags and different URLS. PS:脚本标签太多,URL不同。 I don't prefer replace text from the output. 我不喜欢从输出中替换文本。

Use .find(text=True) 使用.find(text=True)

EX: EX:

from bs4 import BeautifulSoup

html = """<span class="age">
    Ages 15
    <span class="loc" id="loc_loads1">
     </span>
     <script>
        getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
     </script>
</span>"""

soup = BeautifulSoup(html, "html.parser")
print(soup.find("span", {"class": "age"}).find(text=True).strip())

Output: 输出:

Ages 15

Late answer, but for future reference, you can also use decompose() to remove all script elements from the html , ie: 迟到的答案,但为了将来参考,您还可以使用decompose()html删除所有script元素,即:

soup = BeautifulSoup(html, "html.parser")                  
# remove script and style elements                         
for script in soup(["script", "style"]):                   
    script.decompose()                                     
print(soup.find("span", {"class": "age"}).text.strip())    
# Ages 15

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM