I have managed to scrape a website using the findAll function in beautiful soup with H2 / Class / Div tags. (eg soup.findAll('div', {'class' : 'price'}) But there is one part of the website that has P tags which I'm not sure how to scrape. It has the below
Listing history
<p class="top">
<strong>First listed</strong><br>
800 on
I want the 800 but the Div Class "Sidebar sbt" has several entries on the website as does the p class = top. Any help would be appreciated
Thanks
You can find the p tags just as you would any other tag using BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> with open('html', 'r') as f:
... soup = BS(f, "lxml")
...
>>> soup.find_all('p', attrs={'class':'top'})
[<p class="top">
<strong>First listed</strong><br/>
800 on
</p>]
using soup.find_all
will produce a ResultSet if there is more than one tag. So from there you would do something like:
>>> p_tags = soup.find_all('p', attrs={'class':'top'})
>>> for tag in p_tags:
... tag.get_text()
...
'\nFirst listed\n 800 on\n'
If the real case is just like the example
Try something like this:
from bs4 import BeautifulSoup
>>> html = """<div class="price">
<p class="top">
<strong>First listed</strong><br>
800 on
</p>
<p class="top">
<strong>First listed</strong><br>
900 on
</p>
<p class="top">
<strong>First listed</strong><br>
1000 on
</p>
</div>"""
>>> soup = BeautifulSoup(html)
>>> div = soup.find_all('div', class_'price')
>>> for p_tag in div:
""" will search for all p tags in the div"""
... p = p_tag.find('p', class_='top').text.split()[-2]
""" will split the example with spaces and will make a list of result. if you want only the 800 use [-2]"""
... print(p)
# 800
# 900
# 1000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.