简体   繁体   English

网络抓取:使用python从air的xpath中提取url:airbnb列表

[英]webscraping: extracting url from xpath in html using python: airbnb listings

I am trying to extract urls for listings from a city page in AirBnb, using python 3 libraries. 我正在尝试使用python 3库从AirBnb的城市页面中提取列表的网址。 I am familiar with how to scrape simpler websites with Beautifulsoup and requests libraries. 我熟悉如何使用Beautifulsoup抓取更简单的网站并请求库。

url: ' https://www.airbnb.com/s/Denver--CO--United-States/homes ' 网址:“ https://www.airbnb.com/s/Denver--CO--United States / homes

element in the html html中的元素

If I inspect the element of a link on the page (in Chrome), I get: 如果我检查页面上链接的元素(在Chrome中),则会得到:

xpath: "//*[@id="listing-9770909"]/div[2]/a"
selector: "listing-9770909 > div._v72lrv > a"

My attempts: 我的尝试:

import requests
from bs4 import BeautifulSoup

url = 'https://www.airbnb.com/s/Denver--CO--United-States/homes'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
divs = soup.find_all('div', attrs={'id': 'listing'})

attempt 2: 尝试2:

import requests
from lxml import html

page = requests.get(url)
root = html.fromstring(page.content)
tree = root.getroottree()
result = root.xpath('//div[@id="listing-9770909"]/div[2]/a')
for r in result:
    print(r)

Neither of these returns anything. 这些都不返回任何东西。 What I need to be able to extract is the url for the page link. 我需要能够提取的是页面链接的URL。 Any ideas? 有任何想法吗?

To extract the links, first you have to make sure that the urls to the links exists in the page source. 要提取链接,首先必须确保页面源中存在链接的URL。 For this you can search with any of the listing ids in the page source(ctrl+u if you are using google chrome,mozilla firefox). 为此,您可以使用页面源中的任何列表ID进行搜索(如果使用的是google chrome,mozilla firefox,则为ctrl + u)。 If the urls exist in the page source you can directly scrape them using xpath in the response text of the listing page. 如果网址存在于页面源中,则可以使用列表页面的响应文本中的xpath直接将其抓取。 Here the above listing page of Airbnb is not having the links in the page source, so the page might be sending requests to some other pages(usually json requests). 这里上面的Airbnb列表页面在页面源中没有链接,因此该页面可能正在将请求发送到其他一些页面(通常是json请求)。 You can find out those requests and send requests to those pages and get the required data. 您可以找出这些请求并将请求发送到这些页面并获取所需的数据。 Please comment if you have any doubt regarding this. 如果对此有任何疑问,请发表评论。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM