使用 Selenium 和 BeautifulSoup4 抓取動態加載的 Href 屬性

Question

我試圖用 Selenium 和 BeautifulSoup4 抓取動態加載的 href 屬性。

當我查看網站源時，href 屬性為空但是當我單擊檢查元素時，href 屬性將有一個鏈接。 表示 href 屬性是動態加載的。 我怎樣才能提取該鏈接？

我正在嘗試以下代碼

def Scrape_Udemy():
    driver.get('https://couponscorpion.com/marketing/complete-guide-to-pinterest-pinterest-growth-2020/')
    content = driver.page_source
    soup = BeautifulSoup(content, 'html.parser')
    course_link = soup.find_all('div',{'class':"rh_button_wrapper"})
    for i in course_link:
        link = i.find('a',href=True)
        if link is None:
           print('No Links Found')
        print(link['href'])

但是當我運行 function 時，這是打印 []。 我正在使用 Chrome 驅動程序我該如何解決這個問題。 我想從Url https://couponscorpion.com/marketing/complete-guide-to-pinterest-pinterest-growth-2020/ 獲取免費優惠券代碼鏈接

Answer 1

兩件事情

在獲取頁面源之前，有一個允許需要單擊的框
您的鏈接是span而不是div的直接子級

代碼

import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r'c:\users\aaron\chromedriver.exe')
driver.get('https://couponscorpion.com/marketing/complete-guide-to-pinterest-pinterest-growth-2020/')
time.sleep(5)
driver.find_element_by_xpath('//button[@class="align-right primary slidedown-button"]').click()
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
course_link = soup.find_all('span',{'class':"rh_button_wrapper"})
for i in course_link:
    link = i.find('a',href=True)
    if link is None:
        print('No Links Found')
    print(link['href'])

Output

https://couponscorpion.com/scripts/udemy/out.php?go=Q25aTzVXS1l0TXg1TExNZHE5a3pEUEM4SUxUZlBhWEhZWUwwd2FnS3RIVC96cE5lZEpKREdYcUFMSzZZaGlCM0V6RzF1eUE3aVJNaURZTFp5L0tKeVZ4dmRjOTcxN09WbVlKVXhOOGtIY2M9&s=e89c8d0358244e237e0e18df6b3fe872c1c1cd11&n=1298829005&a=0

解釋

始終查看執行driver.get()時會發生什么，有時需要單擊某些框才能獲取頁面源。 必須進行所有瀏覽器活動。

這是我們使用 XPATH 選擇器在該框上找到要單擊的元素。

//button[@class="align-right primary slidedown-button"]

這表示

// - The entire DOM 
button - The HTML tag we want
[@class=""] - The HTML tag with class ""

我通常會在訪問元素之前花一些時間等待，此頁面需要一段時間才能加載，並且通常您需要添加一些等待才能獲得所需的元素或頁面的一部分。

有幾種方法可以做到這一點，這是使用模塊時間的快速而骯臟的方法。 使用 selenium 可以通過特定方法等待元素出現。 我實際上有一個 go 並且無法讓它工作。

請參閱此處的文檔和此處了解值得了解的特定部分。

如果您查看 HTML，您會看到鏈接位於 class rh_button_wrapper的span元素后面，而不是 div。

使用 Selenium 和 BeautifulSoup4 抓取動態加載的 Href 屬性

問題描述

1 個解決方案

解決方案1
0 已采納 2020-08-09 07:04:21

代碼

Output

解釋

使用 Selenium 和 BeautifulSoup4 抓取動態加載的 Href 屬性

問題描述

1 個解決方案

解決方案1 0 已采納 2020-08-09 07:04:21

代碼

Output

解釋

解決方案1
0 已采納 2020-08-09 07:04:21