繁体   English   中英

尝试使用 BeautifulSoup 从 Kayak 网站获取 href URL

[英]Trying to grab href URLs from Kayak website using BeautifulSoup

我试图从出现在这个 Kayak 网站上的每张卡片中获取 URL,当我尝试运行下面的代码时,我收到了BrokenPipeError: [Errno 32] Broken pipe错误。 有人可以帮助我获得正确的代码以从该页面的航班结果中获取所有 URL 吗?

url = 'https://www.kayak.com/flights/AMS-WMI,nearby/2023-02-15/WMI-SOF,nearby/2023-02-18/SOF-BEG,nearby/2023-02-20/BEG-MIL,nearby/2023-02-23/MIL-AMS,nearby/2023-02-25/?sort=bestflight_a&fs=stops=-2&attempt=1&lastms=1675195877028'
requests = 0

chrome_options = webdriver.ChromeOptions()
agents = ["Firefox/66.0.3","Chrome/73.0.3683.68","Edge/16.16299"]
print("User agent: " + agents[(requests%len(agents))])
chrome_options.add_argument('--user-agent=' + agents[(requests%len(agents))] + '"')    
chrome_options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome('/Users/junerodriguez/Downloads/chromedriver_mac_arm64/chromedriver')
driver.implicitly_wait(10)
driver.get(url)
sleep(randint(8,10))

xp_hrefs = "//div[@class='above-button']//a[contains(@class,'booking-link')]/href[@class='col col-best']"

hrefs = driver.find_elements_by_xpath(xp_hrefs)
hrefs

在此处输入图像描述

在 Selenium 中,您应该使用 XPaths 定位 web 元素,而不是它们的属性。
要提取href属性值,您需要将所有这些a web 元素收集到一个列表中,然后迭代该列表以从列表中的每个 web 元素中提取href属性,如下所示:

hrefs = [link.get_attribute('href') for link in driver.find_elements(By.XPATH,"//div[@class='above-button']//a[contains(@class,'booking-link')]")]

在上面的代码中,您将所有匹配的 web 元素添加到列表中,然后为该列表中的每个link元素应用link.get_attribute('href')以提取href属性值。
结果被收集到hrefs列表中。

要从网站内的所有href属性中提取链接,您必须为visibility_of_all_elements_located()引入WebDriverWait ,并且您可以使用以下任一定位器策略

  • 使用CSS_SELECTOR文本属性:

     driver.get("https://www.kayak.com/flights/AMS-WMI,nearby/2023-02-15/WMI-SOF,nearby/2023-02-18/SOF-BEG,nearby/2023-02-20/BEG-MIL,nearby/2023-02-23/MIL-AMS,nearby/2023-02-25/?sort=bestflight_a&fs=stops=-2&attempt=1&lastms=1675195877028") print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[role='link'][href]")))])
  • 使用XPATHget_attribute("innerHTML")

     driver.get("https://www.kayak.com/flights/AMS-WMI,nearby/2023-02-15/WMI-SOF,nearby/2023-02-18/SOF-BEG,nearby/2023-02-20/BEG-MIL,nearby/2023-02-23/MIL-AMS,nearby/2023-02-25/?sort=bestflight_a&fs=stops=-2&attempt=1&lastms=1675195877028") print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@role='link' and @href]")))])
  • 控制台 Output:

     ['https://www.kayak.com/book/flight?code=OiFir3l_8L.18wgSzIaLAlgpzrTH2pViLaYAeeTFjgE.81197.36c89f7717e84ac7a4ee2898627fa251&h=40d03211086c&sub=E-191e8b4083a&pageOrigin=F..RP.FE.M0', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.47F3EeHCWiIEdn9PX-8xhQ.41000.a6f675f0a632a9d55b0fab7f1b09f9d8&h=8dce29003385&sub=E-10f42a14593&pageOrigin=F..RP.FE.M1', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.18wgSzIaLAlgpzrTH2pViLaYAeeTFjgE.81397.58fb639ccf8938f61eec808f1e13c556&h=ba02be2bf0dc&sub=E-191e8b4083a&pageOrigin=F..RP.FE.M2', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.18wgSzIaLAlgpzrTH2pViLaYAeeTFjgE.81197.aca9104db06bae99e4f55a158dfd3ff2&h=61a4dc653dc3&sub=E-191e8b4083a&pageOrigin=F..RP.FE.M4', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.18wgSzIaLAlgpzrTH2pViLaYAeeTFjgE.81397.bcc92e8ae656b0e298dbe8a6555bd825&h=ece97a1b9509&sub=E-191e8b4083a&pageOrigin=F..RP.FE.M5', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.18wgSzIaLAlgpzrTH2pViLaYAeeTFjgE.80697.732461bd95055d2478850abf1741221f&h=c94d3b283c0a&sub=E-191e8b4083a&pageOrigin=F..RP.FE.M6', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.18wgSzIaLAlgpzrTH2pViLaYAeeTFjgE.80997.215cabd8ee10582a0d6b94c20dfb95ad&h=b493996d9e9d&sub=E-191e8b4083a&pageOrigin=F..RP.FE.M7', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.18wgSzIaLAlgpzrTH2pViLaYAeeTFjgE.80697.cabd6f6051c17b3cd7f9129454607d0e&h=917f7fb0f2f5&sub=E-191e8b4083a&pageOrigin=F..RP.FE.M8', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.18wgSzIaLAlgpzrTH2pViLaYAeeTFjgE.80997.07496f1f93e916d757ec284da1ef4638&h=52abbdb7d13a&sub=E-191e8b4083a&pageOrigin=F..RP.FE.M11', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.eNCwACMVOeJpd4CyPwn0EI6M4XD8KcmF.71697.a1633218d7cbd5eb2fe950504a6207a9&h=c8fe9769a628&sub=E-15b10c5af5f&pageOrigin=F..RP.FE.M12', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.eNCwACMVOeJpd4CyPwn0EI6M4XD8KcmF.71997.ebd834f5c265ae428e1bdbb3637a606b&h=d80290f43a93&sub=E-15b10c5af5f&pageOrigin=F..RP.FE.M13', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.eNCwACMVOeJpd4CyPwn0EI6M4XD8KcmF.71697.c7a2ace471ba5c35334014e91956f849&h=2f3b292c166e&sub=E-15b10c5af5f&pageOrigin=F..RP.FE.M14', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.eNCwACMVOeJpd4CyPwn0EI6M4XD8KcmF.71997.a281f919b379469a223fb34ed5510409&h=913a810b8e80&sub=E-15b10c5af5f&pageOrigin=F..RP.FE.M15', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.eNCwACMVOeJpd4CyPwn0EI6M4XD8KcmF.71697.4fdb4fded43ccbf47dcdcad01bf919e6&h=ea07410d1dda&sub=E-15b10c5af5f&pageOrigin=F..RP.FE.M16', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.eNCwACMVOeJpd4CyPwn0EI6M4XD8KcmF.71997.23101dac562249519c55956ba4cc7abf&h=45c62f765f1f&sub=E-15b10c5af5f&pageOrigin=F..RP.FE.M17', 'https://www.kayak.com/book/flight?code=OiFir3l_8L.eNCwACMVOeJpd4CyPwn0EI6M4XD8KcmF.71697.1377d5650be1523cc39b1849b7d9bbdf&h=c04d73c3ac61&sub=E-15b10c5af5f&pageOrigin=F..RP.FE.M18']
  • 注意:您必须添加以下导入:

     from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM