[英]How do I scrape data using Selenium in Python from a webpage that adds div on scroll?
I am trying to scrape data from the following webpage: https://skiplagged.com/flights/YTO/DXB/2020-08-21 .我正在尝试从以下网页抓取数据: https://skiplagged.com/flights/YTO/DXB/2020-08-21 。
The element I am trying to target is the following: div[@class='infinite-trip-list']//div[@class='span1 trip-duration']
我试图定位的元素如下: div[@class='infinite-trip-list']//div[@class='span1 trip-duration']
This is a list that adds elements dynamically on user scroll.这是一个在用户滚动时动态添加元素的列表。 My target is to store these elements in a variable to extract the duration of each flight.我的目标是将这些元素存储在一个变量中以提取每次飞行的持续时间。 So far, I am not able to do that and this is what I have tried after reading several Stackoverflow posts on such issues.到目前为止,我无法做到这一点,这是我在阅读了几篇关于此类问题的 Stackoverflow 帖子后所尝试的。
mylist = []
last = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1) #let the page load
new = driver.execute_script("return document.body.scrollHeight")
infinite_list = driver.find_elements_by_xpath("//div[@class='infinite-trip-list']//div[@class='span1 trip-duration']")
for elem in infinite_list:
if elem not in mylist:
mylist.append(elem.text)
if new == last: #if new height is equal to last height then we have reached the end of the page so end the while loop
break
last = new #changing last's value to new
This is scrolling the page till the bottom and as a result I am only seeing the last 10 values appear.这是将页面滚动到底部,因此我只看到最后 10 个值出现。 I am not able to write a piece of code that can possibly scroll and add only the new divs (elements) that are being added.我无法编写一段可能滚动并仅添加正在添加的新 div(元素)的代码。
Try the below approach using Requets API way it is fast, reliable and less code is needed to get the desired output.使用Requets API 尝试以下方法,它快速、可靠且需要更少的代码来获得所需的 output。 I have fetched the API URL from the website to GET the result on the basis of search.我已经从网站上获取了API URL以在搜索的基础上获取结果。
You can fetch more details by using the below script right now it is fetching prices, flight number, hop details at airport, duration etc.您现在可以使用以下脚本获取更多详细信息,它正在获取价格、航班号、机场的跳点详细信息、持续时间等。
def scrap_flights_details():
from_source = 'YTO'
to_destination = 'DXB'
depart_date = '2020-08-21'
return_date = ''
counts_adults = 1
counts_children = ''
API_URL = 'https://skiplagged.com/api/search.php?from=' + str(from_source) + '&to=' + str(to_destination) + '&depart=' + str(depart_date) + '&'\
'return=' + str(return_date) + '&format=v3&counts%5Badults%5D=' + str(counts_adults) + '&counts%5Bchildren%5D=' + str(counts_children)
print('URL created: ',API_URL)
flights_details = requests.get(API_URL,verify=False).json()
for flight_number in flights_details['itineraries']['outbound']:
print('-' * 100)
print('Flight Number : ',flight_number['flight'])
print('Flight Price : ', flight_number['one_way_price'])
number = flight_number['flight']
print('Flight Details : ',flights_details['flights'][number])
print('-' * 100)
scrap_flights_details()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.