简体   繁体   English

使用 BeautifulSoup 抓取网页时出错

[英]Error while web-scraping using BeautifulSoup

I am gathering housing data from zillow's website.So far I have gathered data from the first webpage.For my next step, I am trying to find links to the next button, which will navigate me to page 2, page 3, and so on.我正在从 zillow 的网站收集住房数据。到目前为止,我已经从第一个网页收集了数据。下一步,我试图找到指向下一个按钮的链接,该按钮会将我导航到第 2 页、第 3 页等. I used the Inspect feature of Chrome to locate the 'next button' button, which has the following structure我使用 Chrome 的 Inspect 功能来定位“下一步按钮”按钮,其结构如下

<a href=”/homes/recently_sold/house_type/47164_rid/0_singlestory/37.720288,-121.859322,37.601788,-121.918888_rect/12_zm/2_p/” class=”on” onclick=”SearchMain.changePage(2);return false;” id=”yui_3_18_1_1_1525048531062_27962">Next</a>

I then used Beautiful Soup's find_all method and filter on tag “a” and class “on”.I used the following code to extract all the links然后我使用 Beautiful Soup 的 find_all 方法并过滤标签“a”和类“on”。我使用以下代码提取所有链接

driver = webdriver.Chrome(chromedriver)  
zillow_bellevue_1="https://www.zillow.com/homes/Bellevue-WA-98004_rb/"
driver.get(zillow_bellevue_1)   
soup = BeautifulSoup(driver.page_source,'html.parser')

next_button = soup.find_all("a", class_="on")  
print(next_button)

I am not getting any output.Any inputs on where I am going wrong?我没有得到任何输出。关于我哪里出错的任何输入?

The class for the next button appears to be off not on , as such you could scrape details of each property and advance through all the pages as follows. next按钮的类似乎是off而不是on ,因此您可以抓取每个属性的详细信息并按如下方式浏览所有页面。 It uses the requests library to get the HTML which should be faster than using a chrome driver.它使用requests库来获取 HTML,这应该比使用 chrome 驱动程序更快。

from bs4 import BeautifulSoup
import requests

base_url = "https://www.zillow.com"
url = base_url + "/homes/Bellevue-WA-98004_rb/"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}    

while url:
    req = requests.get(url, headers=headers)   
    soup = BeautifulSoup(req.content, 'html.parser')
    print('\n' + url)

    for div in soup.find_all('div', class_="zsg-photo-card-caption"):
        print("  {}".format(list(div.stripped_strings)))

    next_button = soup.find("a", class_="off", href=True)  
    url = base_url + next_button['href'] if next_button else None

This continues requesting URLs until no next button is found.这将继续请求 URL,直到找不到下一个按钮。 The output would be of the form:输出将采用以下形式:

https://www.zillow.com/homes/Bellevue-WA-98004_rb/
  ['New Construction', '$2,224,995+', '5 bds', '·', '4 ba', '·', '3,796+ sqft', 'The Castille Plan, Verano', 'D.R. Horton - Seattle']
  ['12 Central Square', '2', '$2,550+', '10290 NE 12th St, Bellevue, WA']
  ['Apartment For Rent', '$1,800/mo', '1 bd', '·', '1 ba', '·', '812 sqft', '10423 NE 32nd Pl APT E105, Bellevue, WA']
  ['House For Sale', '$1,898,000', '5 bds', '·', '4 ba', '·', '4,030 sqft', '3230 108th Ave SE, Bellevue, WA', 'Quorum Real Estate/Madison Inc']
  ['New Construction', '-- bds', '·', '-- ba', '·', '-- sqft', 'Coming Soon Plan, Northtowne', 'D.R. Horton - Seattle']
  ['The Meyden', '0', '$1,661+', '1', '$2,052+', '2', '$3,240+', '10333 Main St, Bellevue, WA']

I think it will be easier if you are using soup.findAll我认为如果您使用soup.findAll会更容易

my solution goes this way:我的解决方案是这样的:

zillow_url = URL
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
response = requests.get(zillow_url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

prices = ["$" + re.sub(r'(\s\d)|(\W)|([a-z]+)', "", div.text.split("/")[0], ) for div in
          soup.find_all('div', class_='list-card-price')]
# print(prices)
addresses = [div.text for div in
             soup.findAll('address', class_='list-card-addr')]

urls = [x.get('href')  if 'http' in x.get('href') else 'https://www.zillow.com' + x.get('href') for x in soup.find_all("a", class_="list-card-link list-card-link-top-margin list-card-img")]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM