简体   繁体   中英

Error while web-scraping using BeautifulSoup

I am gathering housing data from zillow's website.So far I have gathered data from the first webpage.For my next step, I am trying to find links to the next button, which will navigate me to page 2, page 3, and so on. I used the Inspect feature of Chrome to locate the 'next button' button, which has the following structure

<a href=”/homes/recently_sold/house_type/47164_rid/0_singlestory/37.720288,-121.859322,37.601788,-121.918888_rect/12_zm/2_p/” class=”on” onclick=”SearchMain.changePage(2);return false;” id=”yui_3_18_1_1_1525048531062_27962">Next</a>

I then used Beautiful Soup's find_all method and filter on tag “a” and class “on”.I used the following code to extract all the links

driver = webdriver.Chrome(chromedriver)  
zillow_bellevue_1="https://www.zillow.com/homes/Bellevue-WA-98004_rb/"
driver.get(zillow_bellevue_1)   
soup = BeautifulSoup(driver.page_source,'html.parser')

next_button = soup.find_all("a", class_="on")  
print(next_button)

I am not getting any output.Any inputs on where I am going wrong?

The class for the next button appears to be off not on , as such you could scrape details of each property and advance through all the pages as follows. It uses the requests library to get the HTML which should be faster than using a chrome driver.

from bs4 import BeautifulSoup
import requests

base_url = "https://www.zillow.com"
url = base_url + "/homes/Bellevue-WA-98004_rb/"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}    

while url:
    req = requests.get(url, headers=headers)   
    soup = BeautifulSoup(req.content, 'html.parser')
    print('\n' + url)

    for div in soup.find_all('div', class_="zsg-photo-card-caption"):
        print("  {}".format(list(div.stripped_strings)))

    next_button = soup.find("a", class_="off", href=True)  
    url = base_url + next_button['href'] if next_button else None

This continues requesting URLs until no next button is found. The output would be of the form:

https://www.zillow.com/homes/Bellevue-WA-98004_rb/
  ['New Construction', '$2,224,995+', '5 bds', '·', '4 ba', '·', '3,796+ sqft', 'The Castille Plan, Verano', 'D.R. Horton - Seattle']
  ['12 Central Square', '2', '$2,550+', '10290 NE 12th St, Bellevue, WA']
  ['Apartment For Rent', '$1,800/mo', '1 bd', '·', '1 ba', '·', '812 sqft', '10423 NE 32nd Pl APT E105, Bellevue, WA']
  ['House For Sale', '$1,898,000', '5 bds', '·', '4 ba', '·', '4,030 sqft', '3230 108th Ave SE, Bellevue, WA', 'Quorum Real Estate/Madison Inc']
  ['New Construction', '-- bds', '·', '-- ba', '·', '-- sqft', 'Coming Soon Plan, Northtowne', 'D.R. Horton - Seattle']
  ['The Meyden', '0', '$1,661+', '1', '$2,052+', '2', '$3,240+', '10333 Main St, Bellevue, WA']

I think it will be easier if you are using soup.findAll

my solution goes this way:

zillow_url = URL
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
response = requests.get(zillow_url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

prices = ["$" + re.sub(r'(\s\d)|(\W)|([a-z]+)', "", div.text.split("/")[0], ) for div in
          soup.find_all('div', class_='list-card-price')]
# print(prices)
addresses = [div.text for div in
             soup.findAll('address', class_='list-card-addr')]

urls = [x.get('href')  if 'http' in x.get('href') else 'https://www.zillow.com' + x.get('href') for x in soup.find_all("a", class_="list-card-link list-card-link-top-margin list-card-img")]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM