简体   繁体   English

Python通过beautifulsoup抓取Expedia数据

[英]Python Scraping Expedia data by beautifulsoup

I'm trying to scraping the hotel data from expedia.我正在尝试从 expedia 抓取酒店数据。 For example, scraping all the hotel link in Cavendish, Canada, from 01/01/2020 to 01/03/2020.例如,从 01/01/2020 到 01/03/2020 抓取加拿大卡文迪什的所有酒店链接。 But the problem now is I can only scrape 20 of them and it is actually contains 200+ for each place.但现在的问题是我只能20 个,实际上每个地方都包含 200+ 个。 The sample webpage and its url is like:示例网页及其网址如下所示:

https://www.expedia.com/Hotel-Search?adults=2&destination=Cavendish%20Beach%2C%20Cavendish%2C%20Prince%20Edward%20Island%2C%20Canada&endDate=01%2F03%2F2020&latLong=46.504395%2C-63.439669&regionId=6261119&rooms=1&sort=RECOMMENDED&startDate=01%2F01%2F2020 https://www.expedia.com/Hotel-Search?adults=2&destination=Cavendish%20Beach%2C%20Cavendish%2C%20Prince%20Edward%20Island%2C%20Canada&endDate=01%2F03%2F2020&latLong=496.52396-36.504 =6261119&rooms=1&sort=RECOMMENDED&startDate=01%2F01%2F2020

Scraping code:抓取代码:

import lxml
import re
import requests
from bs4 import  BeautifulSoup
import xlwt
import pandas as pd
import numpy as np

url = 'https://www.expedia.com/Hotel-Search?adults=2&destination=Cavendish%20Beach%2C%20Cavendish%2C%20Prince%20Edward%20Island%2C%20Canada&endDate=01%2F03%2F2020&latLong=46.504395%2C-63.439669&regionId=6261119&rooms=1&sort=RECOMMENDED&startDate=01%2F01%2F2020'

header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}

res = requests.get(url,headers=header)

soup = BeautifulSoup(res.content,'lxml')

t1 = soup.select('a.listing__link.uitk-card-link')

So every link is stored in <a class='listing__link.uitk-card-link' href=xxxxxxx> </a> inside <li></li> , there is no differences between the html structure, can anyone explain this?所以每个链接都存储在<a class='listing__link.uitk-card-link' href=xxxxxxx> </a>里面的<li></li> ,html结构之间没有区别,谁能解释一下?

They are using API call to get next 20 records.他们正在使用 API 调用来获取接下来的 20 条记录。 There is no way to scrape the next 20 records.没有办法刮下接下来的 20 条记录。

Here is API details they are using when you click on "Show More"这是当您单击“显示更多”时他们正在使用的 API 详细信息

API LINK 接口链接

They have API authentication to get data using API calls.他们具有 API 身份验证以使用 API 调用获取数据。

Note : Scraping works only when you don't have any ajax call and no authentication method.注意:只有当您没有任何 ajax 调用且没有身份验证方法时,抓取才有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM