简体   繁体   中英

How to grab URL in "View Deal" and price for deal from kayak.com using BeautifulSoup

I have a list of Kayak URLs and I'd like to grap the price and link in "View Deal" for the "Best" and "Cheapest" HTML cards, essentially the first two results since I've already sorted the results in the URLs ( here's an example of a URL ).

I can't get around to grabbing these bits of data using beautifulsoup and I could use some help! Here's what I've tried for pulling price info but I'm getting an empty prices_list variable. Below is a screenshot of what exactly I'd like to pull info from in the website.

url = https://www.kayak.com/flights/AMS-WMI,nearby/2023-02-15/WMI-SOF,nearby/2023-02-18/SOF-BEG,nearby/2023-02-20/BEG-MIL,nearby/2023-02-23/MIL-AMS,nearby/2023-02-25/?sort=bestflight_a
requests = 0

chrome_options = webdriver.ChromeOptions()
agents = ["Firefox/66.0.3","Chrome/73.0.3683.68","Edge/16.16299"]
print("User agent: " + agents[(requests%len(agents))])
chrome_options.add_argument('--user-agent=' + agents[(requests%len(agents))] + '"')    
chrome_options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome('/Users/etc./etc.')
driver.implicitly_wait(10)
driver.get(url)

# getting the prices
sleep(randint(8,10))
xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
prices = driver.find_elements_by_xpath(xp_prices)
prices_list = [price.text.replace('$','') for price in prices if price.text != '']
prices_list = list(map(int, prices_list))

在此处输入图像描述

There are 2 problems here with locator XPath:

  1. The a element class name is not booking-link , but booking-link , with trailing space.
  2. Your locator matching duplicating irrelevant (invisible) elements.
    The following locator works:
"//div[@class='above-button']//a[contains(@class,'booking-link')]/span[@class='price option-text']"

So, the relevant code line could be:

xp_prices = "//div[@class='above-button']//a[contains(@class,'booking-link')]/span[@class='price option-text']"

To extract the prices from View Deal for the Best and Cheapest section within the website you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies :

  • From the Best section:

     driver.get("https://www.kayak.com/flights/AMS-WMI,nearby/2023-02-15/WMI-SOF,nearby/2023-02-18/SOF-BEG,nearby/2023-02-20/BEG-MIL,nearby/2023-02-23/MIL-AMS,nearby/2023-02-25/?sort=bestflight_a") print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Best']//following::div[contains(@class, 'bottom-booking')]//a//div[contains(@class, 'price-text')]"))).text)
  • Console output:

     $807
  • From the Cheapest section:

     driver.get("https://www.kayak.com/flights/AMS-WMI,nearby/2023-02-15/WMI-SOF,nearby/2023-02-18/SOF-BEG,nearby/2023-02-20/BEG-MIL,nearby/2023-02-23/MIL-AMS,nearby/2023-02-25/?sort=bestflight_a") print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Cheapest']//following::div[contains(@class, 'bottom-booking')]//a//div[contains(@class, 'price-text')]"))).text)
  • Console output:

     $410
  • Note : You have to add the following imports:

     from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2025 STACKOOM.COM