[英]Extract text from HTML with beautifulSoup
I am trying to parse an html with beautiful soup 4 but unable to get the data 我正在尝试用漂亮的汤4解析html,但无法获取数据
<div class="inside">
<a href="http://www.linkar.com">
<b>A Show</b><br/>
<img alt="A Show" height="83" src="http://www.linkar.com/679.jpg"/>
</a>
<br/>Film : Gladiator
<br/>Location : example street, London, UK
<br/>Phone : +83817447184<br/>
</div>
I am able to get the string "A Show" by using 我可以通过使用获取字符串“ A Show”
soup = BeautifulSoup(html, "html.parser")
a_show = soup.find('b').get_text()
How can I get values of strings Film, Location and Phone seperately? 如何分别获取字符串Film,Location和Phone的值?
You can use BS
with re
. 您可以将BS
与re
一起使用。
Ex: 例如:
from bs4 import BeautifulSoup
import re
html = """<div class="inside">
<a href="http://www.linkar.com">
<b>A Show</b><br/>
<img alt="A Show" height="83" src="http://www.linkar.com/679.jpg"/>
</a>
<br/>Film : Gladiator
<br/>Location : example street, London, UK
<br/>Phone : +83817447184<br/>
</div>"""
soup = BeautifulSoup(html, "html.parser")
a_show = soup.find('div', class_="inside").text
film = re.search("Film :(.*)", a_show)
if film:
print(film.group())
location = re.search("Location :(.*)", a_show)
if location:
print(location.group())
phone = re.search("Phone :(.*)", a_show)
if phone:
print(phone.group())
Output: 输出:
Film : Gladiator
Location : example street, London, UK
Phone : +83817447184
or 要么
content = re.findall("(Film|Location|Phone) :(.*)", a_show)
if content:
print(content)
# --> [(u'Film', u' Gladiator'), (u'Location', u' example street, London, UK'), (u'Phone', u' +83817447184')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.