简体   繁体   中英

How to scrape P tags using beautiful soup

I have managed to scrape a website using the findAll function in beautiful soup with H2 / Class / Div tags. (eg soup.findAll('div', {'class' : 'price'}) But there is one part of the website that has P tags which I'm not sure how to scrape. It has the below

Listing history

<p class="top">
    <strong>First listed</strong><br>
            800 on

I want the 800 but the Div Class "Sidebar sbt" has several entries on the website as does the p class = top. Any help would be appreciated

Thanks

You can find the p tags just as you would any other tag using BeautifulSoup:

>>> from bs4 import BeautifulSoup as BS
>>> with open('html', 'r') as f:
...     soup = BS(f, "lxml")
... 
>>> soup.find_all('p', attrs={'class':'top'})
[<p class="top">
<strong>First listed</strong><br/>
            800 on
</p>]

using soup.find_all will produce a ResultSet if there is more than one tag. So from there you would do something like:

>>> p_tags = soup.find_all('p', attrs={'class':'top'})
>>> for tag in p_tags:
...     tag.get_text()
... 
'\nFirst listed\n            800 on\n'

If the real case is just like the example

Try something like this:

from bs4 import BeautifulSoup
>>> html = """<div class="price">

 <p class="top">
     <strong>First listed</strong><br>
             800 on
 </p>
 <p class="top">
     <strong>First listed</strong><br>
             900 on
 </p>
 <p class="top">
     <strong>First listed</strong><br>
             1000 on
 </p>

 </div>"""
>>> soup = BeautifulSoup(html)
>>> div = soup.find_all('div', class_'price')
>>> for p_tag in div:
""" will search for all p tags in the div"""
...    p = p_tag.find('p', class_='top').text.split()[-2] 
""" will split the example with spaces and will make a list of result. if you want only the 800 use [-2]""" 
...    print(p)        
# 800
# 900
# 1000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM