简体   繁体   English

从 Zillow 抓取数据的最佳方法是什么?

[英]Whats the best way to scrape data from Zillow?

I have been unsuccessful in trying to gather data from Zillow.我试图从 Zillow 收集数据没有成功。

Example:例子:

url = https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy

I want to pull information like addresses, prices, zestimates, locations from all homes in LA.我想从洛杉矶的所有家庭中提取地址、价格、估价、位置等信息。

I have tried HTML scraping using packages like BeautifulSoup.我尝试过使用 BeautifulSoup 之类的包进行 HTML 抓取。 I also have tried using the json.我也尝试过使用json。 I'm almost positive that Zillow's API will not be helpful.我几乎可以肯定 Zillow 的 API 不会有帮助。 It's my understanding that the API is best for gathering information on a specific property.据我了解,API 最适合收集特定属性的信息。

I have been able to scrape information from other sites but it seems that Zillow uses dynamic ids (change every refresh) making it more difficult to access that information.我已经能够从其他网站上抓取信息,但似乎 Zillow 使用动态 ID(每次刷新都会更改)使得访问该信息变得更加困难。

UPDATE: Tried using the below code but am still not producing any results更新:尝试使用下面的代码,但仍然没有产生任何结果

import requests
from bs4 import BeautifulSoup

url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'

page = requests.get(url)
data = page.content

soup = BeautifulSoup(data, 'html.parser')

for li in soup.find_all('div', {'class': 'zsg-photo-card-caption'}):
    try:
        #There is sponsored links in the list. You might need to take care 
        #of that
        #Better check for null values which we are not doing in here
        print(li.find('span', {'class': 'zsg-photo-card-price'}).text)
        print(li.find('span', {'class': 'zsg-photo-card-info'}).text)
        print(li.find('span', {'class': 'zsg-photo-card-address'}).text)
        print(li.find('span', {'class': 'zsg-photo-card-broker-name'}).text)
    except :
        print('An error occured')

It's probably because you're not passing headers.这可能是因为您没有传递标题。

If you take a look at Chrome's network tab in developer tools, these are the headers that are passed by the browser:如果您在开发者工具中查看 Chrome 的网络选项卡,这些是浏览器传递的标头:

:authority:www.zillow.com
:method:GET
:path:/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy
:scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.8
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36

However, if you try sending all of them, it'll fail, because requests doesn't let you send headers beginning with a colon ':'.但是,如果您尝试发送所有这些,它将失败,因为requests不允许您发送以冒号“:”开头的标头。

I tried skipping those four alone, and used the other five in this script.我尝试单独跳过这四个,并在此脚本中使用其他五个。 It worked.有效。 So try this:所以试试这个:

from bs4 import BeautifulSoup
import requests

req_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

with requests.Session() as s:
    url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'
    r = s.get(url, headers=req_headers)

After that, you can use BeautifulSoup to extract the information you need:之后,您可以使用BeautifulSoup提取您需要的信息:

soup = BeautifulSoup(r.content, 'lxml')
price = soup.find('span', {'class': 'zsg-photo-card-price'}).text
info = soup.find('span', {'class': 'zsg-photo-card-info'}).text
address = soup.find('span', {'itemprop': 'address'}).text

Here is a sample of data extracted from that page:以下是从该页面提取的数据示例:

+--------------+-----------------------------------------------------------+
| $615,000     |  121 S Hope St APT 435 Los Angeles CA 90012               |
| $330,000     |  4859 Coldwater Canyon Ave APT 14A Sherman Oaks CA 91423  |
| $3,495,000   |  13446 Valley Vista Blvd Sherman Oaks CA 91423            |
| $1,199,000   |  6241 Crescent Park W UNIT 410 Los Angeles CA 90094       |
| $771,472+    |  Chase St. And Woodley Ave # HGS0YX North Hills CA 91343  |
| $369,000     |  8650 Gulana Ave UNIT L2179 Playa Del Rey CA 90293        |
| $595,000     |  6427 Klump Ave North Hollywood CA 91606                  |
+--------------+-----------------------------------------------------------+

Scraping Zillow is actually not too difficult.刮 Zillow 其实并不太难。 First thing to note is that it's a Next.js website, meaning we can parse javascript objects instead of HTML to scrape structured data.首先要注意的是它是一个 Next.js 网站,这意味着我们可以解析 javascript 对象而不是 HTML 来抓取结构化数据。

I write all of this in extensive detail in my blog How to Scrape Zillow.com but let's summarize the most important parts:我在我的博客How to Scrape Zillow.com中详细地写了所有这些内容,但让我们总结一下最重要的部分:

Scraping Properties刮擦属性

First, let's take a look at property page itself and how to scrape it's data.首先,让我们看一下属性页本身以及如何抓取它的数据。 If we take a look at the property page's source we can see that __NEXT_DATA__ variable is present:如果我们查看属性页的源代码,我们可以看到存在__NEXT_DATA__变量:

在此处输入图像描述

So we can extract this data with a simple css selector: script#__NEXT_DATA__所以我们可以用一个简单的css选择器提取这些数据: script#__NEXT_DATA__

Scraping Search抓取搜索

Now to find properties themselves we can use very similar technique:现在要查找属性本身,我们可以使用非常相似的技术:

First, We need to build our search url which looks something like https://www.zillow.com/homes/<QUERY>_rb/ where <QUERY> is some location like zipcode or city name.首先,我们需要构建我们的搜索 url,它看起来像https://www.zillow.com/homes/<QUERY>_rb/其中<QUERY>是某个位置,例如邮政编码或城市名称。 eg https://www.zillow.com/homes/New-Haven,-CT_rb/ Then, If we scrape the page we can find backend API parameters in the page body same way we found __NEXT_DATA__ previously.例如https://www.zillow.com/homes/New-Haven,-CT_rb/然后,如果我们抓取页面,我们可以在页面正文中找到后端 API 参数,就像我们之前找到__NEXT_DATA__一样。 This time by using regex:这次使用正则表达式:

re.findall(r'"queryState":(\{.+}),\s*"filter', html_response.text

Full scraper code is a bit ouf of scope of a Stackoverlfow question but by combining these two techniques we can scrape zillow with very little actual code!完整的爬虫代码有点超出 Stackoverlfow 问题的范围,但是通过结合这两种技术,我们可以用很少的实际代码来爬取 zillow!

You can try some paid tools like https://www.scraping-bot.io/how-to-scrape-real-estate-listings-on-zillow/您可以尝试一些付费工具,例如https://www.scraping-bot.io/how-to-scrape-real-estate-listings-on-zillow/

  1. find what you need via sitemap https://www.zillow.com/sitemap/catalog/sitemap.xml通过站点地图找到您需要的东西https://www.zillow.com/sitemap/catalog/sitemap.xml

  2. scrape data from urls in sitemap从站点地图中的 url 抓取数据

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM