简体   繁体   中英

How do I scrape data using Selenium in Python from a webpage that adds div on scroll?

I am trying to scrape data from the following webpage: https://skiplagged.com/flights/YTO/DXB/2020-08-21 .

The element I am trying to target is the following: div[@class='infinite-trip-list']//div[@class='span1 trip-duration']

This is a list that adds elements dynamically on user scroll. My target is to store these elements in a variable to extract the duration of each flight. So far, I am not able to do that and this is what I have tried after reading several Stackoverflow posts on such issues.

mylist = []

last = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1) #let the page load
    new = driver.execute_script("return document.body.scrollHeight")
    infinite_list = driver.find_elements_by_xpath("//div[@class='infinite-trip-list']//div[@class='span1 trip-duration']")
    for elem in infinite_list:
        if elem not in mylist:
            mylist.append(elem.text)
    if new == last: #if new height is equal to last height then we have reached the end of the page so end the while loop
        break
    last = new #changing last's value to new

This is scrolling the page till the bottom and as a result I am only seeing the last 10 values appear. I am not able to write a piece of code that can possibly scroll and add only the new divs (elements) that are being added.

Try the below approach using Requets API way it is fast, reliable and less code is needed to get the desired output. I have fetched the API URL from the website to GET the result on the basis of search.

  1. First i have created the dynamic URL. If you see the below script i have declare 6 variables to create API URL in the variables you can pass your search criteria like from, to, departure date, return date, no. of adults or children.
  2. After creating the URL requests method will ping the API URL to get the data and convert that data to JSON.
  3. Finally first i'm fetching the flight numbers to get the details of that flight number like prices, duration and segments(basically HOP details like flight number, airlines name at different Airports with their time).

You can fetch more details by using the below script right now it is fetching prices, flight number, hop details at airport, duration etc.

def scrap_flights_details():

from_source = 'YTO'
to_destination = 'DXB'
depart_date = '2020-08-21'
return_date = ''
counts_adults = 1
counts_children = ''

API_URL = 'https://skiplagged.com/api/search.php?from=' + str(from_source) + '&to=' + str(to_destination) + '&depart=' + str(depart_date) + '&'\
       'return=' + str(return_date) + '&format=v3&counts%5Badults%5D=' + str(counts_adults) + '&counts%5Bchildren%5D=' + str(counts_children)
print('URL created: ',API_URL)
flights_details = requests.get(API_URL,verify=False).json()

for flight_number in flights_details['itineraries']['outbound']:
    print('-' * 100)
    print('Flight Number : ',flight_number['flight'])
    print('Flight Price : ', flight_number['one_way_price'])
    number = flight_number['flight']
    print('Flight Details : ',flights_details['flights'][number])
    print('-' * 100) 

scrap_flights_details()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM