简体   繁体   中英

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

I am using Beautiful Soup in Python to scrape some data from a property listings site.

I have had success in scraping the individual elements that I require but wish to use a more efficient script to pull back all the data in one command if possible. The difficulty is that the various elements I require reside in different classes.

I have tried the following, so far.

for listing in content.findAll('h2', attrs={"class": "listing-results-attr"}):
    print(listing.text)

which successfully gives the following list

15 room mansion for sale
3 bed barn conversion for sale
2 room duplex for sale
1 bed garden shed for sale

Separately, to retrieve the address details for each listing I have used the following successfully;

for address in content.findAll('a', attrs={"class": "listing-results-address"}):
    print(address.text)

which gives this

22 Acacia Avenue, CityName Postcode
100 Sleepy Hollow, CityName Postcode
742 Evergreen Terrace, CityName Postcode
31 Spooner Street, CityName Postcode

And for property price I have used this...

for prop_price in content.findAll('a', attrs={"class": "listing-results-price"}):
    print(prop_price.text)

which gives...

$350,000
$1,250,000
$750,000
$100,000

This is great however I need to be able to pull back all of this information in a more efficient and performant way such that all the data comes back in one pass.

At present I can do this using something like the code below:

all = content.select("a.listing-results-attr, h2.listing-results-address, a.listing-results-price")

This works somewhat but brings back too much additional HTML tags and is just not nearly as elegant or sophisticated as I require. Results as follows.

</a>, <h2 class="listing-results-attr">
<a href="redacted" style="text-decoration:underline;">15 room mansion for sale</a>
</h2>, <a class="listing-results-address" href="redacted">22 Acacia Avenue, CityName Postcode</a>, <a class="listing-results-price" href="redacted">

$350,000

Expected results should look something like this:

15 room mansion for sale
22 Acacia Avenue, CityName Postcode
$350,000

3 bed barn conversion for sale
100 Sleepy Hollow, CityName Postcode
$1,250,000

etc 
etc

I then need to be able to store the results as JSON objects for later analysis.

Thanks in advance.

Change your selectors as shown below:

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
details = ([item.text.strip() for item in soup.select(".listing-results-attr a, .listing-results-address , .text-price")])

You can view separately with, for example,

prices = details[0::3]
descriptions = details[1::3]
addresses = details[2::3]
print(prices, descriptions, addresses)

find_all() function always returns a list, strip() is remove spaces at the beginning and at the end of the string.

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')

results = soup.find("ul",{'class':"listing-results clearfix js-gtm-list"})

for li in results.find_all("li",{'class':"srp clearfix"}):
    price = li.find("a",{"class":"listing-results-price text-price"}).text.strip()
    address = li.find("a",{'class':"listing-results-address"}).text.strip()
    description = li.find("h2",{'class':"listing-results-attr"}).find('a').text.strip()

    print(description)
    print(address)
    print(price)

O/P:

2 bed detached bungalow for sale
Bronrhiw Fach, Caerphilly CF83
£159,950
2 bed semi-detached house for sale
Cwrt Nant Y Felin, Caerphilly CF83
£159,950
3 bed semi-detached house for sale
Pen-Y-Bryn, Caerphilly CF83
£102,950
.....

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM