简体   繁体   English

如何使用 Python 中的 Selenium 从在滚动上添加 div 的网页中抓取数据?

[英]How do I scrape data using Selenium in Python from a webpage that adds div on scroll?

I am trying to scrape data from the following webpage: https://skiplagged.com/flights/YTO/DXB/2020-08-21 .我正在尝试从以下网页抓取数据: https://skiplagged.com/flights/YTO/DXB/2020-08-21

The element I am trying to target is the following: div[@class='infinite-trip-list']//div[@class='span1 trip-duration']我试图定位的元素如下: div[@class='infinite-trip-list']//div[@class='span1 trip-duration']

This is a list that adds elements dynamically on user scroll.这是一个在用户滚动时动态添加元素的列表。 My target is to store these elements in a variable to extract the duration of each flight.我的目标是将这些元素存储在一个变量中以提取每次飞行的持续时间。 So far, I am not able to do that and this is what I have tried after reading several Stackoverflow posts on such issues.到目前为止,我无法做到这一点,这是我在阅读了几篇关于此类问题的 Stackoverflow 帖子后所尝试的。

mylist = []

last = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1) #let the page load
    new = driver.execute_script("return document.body.scrollHeight")
    infinite_list = driver.find_elements_by_xpath("//div[@class='infinite-trip-list']//div[@class='span1 trip-duration']")
    for elem in infinite_list:
        if elem not in mylist:
            mylist.append(elem.text)
    if new == last: #if new height is equal to last height then we have reached the end of the page so end the while loop
        break
    last = new #changing last's value to new

This is scrolling the page till the bottom and as a result I am only seeing the last 10 values appear.这是将页面滚动到底部,因此我只看到最后 10 个值出现。 I am not able to write a piece of code that can possibly scroll and add only the new divs (elements) that are being added.我无法编写一段可能滚动并仅添加正在添加的新 div(元素)的代码。

Try the below approach using Requets API way it is fast, reliable and less code is needed to get the desired output.使用Requets API 尝试以下方法,它快速、可靠且需要更少的代码来获得所需的 output。 I have fetched the API URL from the website to GET the result on the basis of search.我已经从网站上获取了API URL以在搜索的基础上获取结果。

  1. First i have created the dynamic URL.首先,我创建了动态 URL。 If you see the below script i have declare 6 variables to create API URL in the variables you can pass your search criteria like from, to, departure date, return date, no.如果您看到下面的脚本,我已经声明了 6 个变量来创建 API URL 在变量中您可以传递您的搜索条件,例如从、到、出发日期、返回日期、编号。 of adults or children.的成人或儿童。
  2. After creating the URL requests method will ping the API URL to get the data and convert that data to JSON. After creating the URL requests method will ping the API URL to get the data and convert that data to JSON.
  3. Finally first i'm fetching the flight numbers to get the details of that flight number like prices, duration and segments(basically HOP details like flight number, airlines name at different Airports with their time).最后,首先我要获取航班号以获取该航班号的详细信息,例如价格、持续时间和航段(基本上是 HOP 详细信息,例如航班号、不同机场的航空公司名称及其时间)。

You can fetch more details by using the below script right now it is fetching prices, flight number, hop details at airport, duration etc.您现在可以使用以下脚本获取更多详细信息,它正在获取价格、航班号、机场的跳点详细信息、持续时间等。

def scrap_flights_details():

from_source = 'YTO'
to_destination = 'DXB'
depart_date = '2020-08-21'
return_date = ''
counts_adults = 1
counts_children = ''

API_URL = 'https://skiplagged.com/api/search.php?from=' + str(from_source) + '&to=' + str(to_destination) + '&depart=' + str(depart_date) + '&'\
       'return=' + str(return_date) + '&format=v3&counts%5Badults%5D=' + str(counts_adults) + '&counts%5Bchildren%5D=' + str(counts_children)
print('URL created: ',API_URL)
flights_details = requests.get(API_URL,verify=False).json()

for flight_number in flights_details['itineraries']['outbound']:
    print('-' * 100)
    print('Flight Number : ',flight_number['flight'])
    print('Flight Price : ', flight_number['one_way_price'])
    number = flight_number['flight']
    print('Flight Details : ',flights_details['flights'][number])
    print('-' * 100) 

scrap_flights_details()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Selenium和Python抓取嵌套数据 - How do I scrape nested data using selenium and Python> 如何使用 Selenium 从 Trip Advisor 中抓取数据? - Python - How do I scrape data from Trip Advisor by using Selenium? - Python 如何使用 selenium 和 python 抓取数据,我正在尝试提取标题 div 标签中的所有数据 - How to scrape data using selenium and python, I am trying to extract all the data which is in title div tag 如何使用 python 抓取网页中列出的每个个人链接的数据? - How do I scrape the data for each personal links listed in a webpage using python? 如何使用 BeautifulSoup 从网页中抓取结构化表格? - How do i scrape a structured table from a webpage using BeautifulSoup? 如何从通过 selenium 和 python 提交数据后刷新的网页中抓取数据? - How do I scrape data from a web page that refreshes after submitting data via selenium and python? 使用 selenium python 右键单击网页后从 csv 抓取数据 - Scrape data from csv downloaded after right clicking on webpage using selenium python 如何使用selenium python在网站中删除::之前的元素 - How do I scrape ::before element in a website using selenium python 如何从我试图用 selenium 和 scrapy 抓取的网页中获取替代文本 - How do I grab the alt text from this webpage I'm trying to scrape with selenium and scrapy 如何从在 Python 中使用 react.js 和 Selenium 的网页抓取数据? - How to scrape data from webpage which uses react.js with Selenium in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM