简体   繁体   中英

Python Scraping Expedia data by beautifulsoup

I'm trying to scraping the hotel data from expedia. For example, scraping all the hotel link in Cavendish, Canada, from 01/01/2020 to 01/03/2020. But the problem now is I can only scrape 20 of them and it is actually contains 200+ for each place. The sample webpage and its url is like:

https://www.expedia.com/Hotel-Search?adults=2&destination=Cavendish%20Beach%2C%20Cavendish%2C%20Prince%20Edward%20Island%2C%20Canada&endDate=01%2F03%2F2020&latLong=46.504395%2C-63.439669&regionId=6261119&rooms=1&sort=RECOMMENDED&startDate=01%2F01%2F2020

Scraping code:

import lxml
import re
import requests
from bs4 import  BeautifulSoup
import xlwt
import pandas as pd
import numpy as np

url = 'https://www.expedia.com/Hotel-Search?adults=2&destination=Cavendish%20Beach%2C%20Cavendish%2C%20Prince%20Edward%20Island%2C%20Canada&endDate=01%2F03%2F2020&latLong=46.504395%2C-63.439669&regionId=6261119&rooms=1&sort=RECOMMENDED&startDate=01%2F01%2F2020'

header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}

res = requests.get(url,headers=header)

soup = BeautifulSoup(res.content,'lxml')

t1 = soup.select('a.listing__link.uitk-card-link')

So every link is stored in <a class='listing__link.uitk-card-link' href=xxxxxxx> </a> inside <li></li> , there is no differences between the html structure, can anyone explain this?

They are using API call to get next 20 records. There is no way to scrape the next 20 records.

Here is API details they are using when you click on "Show More"

API LINK

They have API authentication to get data using API calls.

Note : Scraping works only when you don't have any ajax call and no authentication method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM