使用 Selenium 和 BeautifulSoup4 抓取动态加载的 Href 属性

Question

我试图用 Selenium 和 BeautifulSoup4 抓取动态加载的 href 属性。

当我查看网站源时，href 属性为空但是当我单击检查元素时，href 属性将有一个链接。 表示 href 属性是动态加载的。 我怎样才能提取该链接？

我正在尝试以下代码

def Scrape_Udemy():
    driver.get('https://couponscorpion.com/marketing/complete-guide-to-pinterest-pinterest-growth-2020/')
    content = driver.page_source
    soup = BeautifulSoup(content, 'html.parser')
    course_link = soup.find_all('div',{'class':"rh_button_wrapper"})
    for i in course_link:
        link = i.find('a',href=True)
        if link is None:
           print('No Links Found')
        print(link['href'])

但是当我运行 function 时，这是打印 []。 我正在使用 Chrome 驱动程序我该如何解决这个问题。 我想从Url https://couponscorpion.com/marketing/complete-guide-to-pinterest-pinterest-growth-2020/ 获取免费优惠券代码链接

Answer 1

两件事情

在获取页面源之前，有一个允许需要单击的框
您的链接是span而不是div的直接子级

代码

import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r'c:\users\aaron\chromedriver.exe')
driver.get('https://couponscorpion.com/marketing/complete-guide-to-pinterest-pinterest-growth-2020/')
time.sleep(5)
driver.find_element_by_xpath('//button[@class="align-right primary slidedown-button"]').click()
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
course_link = soup.find_all('span',{'class':"rh_button_wrapper"})
for i in course_link:
    link = i.find('a',href=True)
    if link is None:
        print('No Links Found')
    print(link['href'])

Output

https://couponscorpion.com/scripts/udemy/out.php?go=Q25aTzVXS1l0TXg1TExNZHE5a3pEUEM4SUxUZlBhWEhZWUwwd2FnS3RIVC96cE5lZEpKREdYcUFMSzZZaGlCM0V6RzF1eUE3aVJNaURZTFp5L0tKeVZ4dmRjOTcxN09WbVlKVXhOOGtIY2M9&s=e89c8d0358244e237e0e18df6b3fe872c1c1cd11&n=1298829005&a=0

解释

始终查看执行driver.get()时会发生什么，有时需要单击某些框才能获取页面源。 必须进行所有浏览器活动。

这是我们使用 XPATH 选择器在该框上找到要单击的元素。

//button[@class="align-right primary slidedown-button"]

这表示

// - The entire DOM 
button - The HTML tag we want
[@class=""] - The HTML tag with class ""

我通常会在访问元素之前花一些时间等待，此页面需要一段时间才能加载，并且通常您需要添加一些等待才能获得所需的元素或页面的一部分。

有几种方法可以做到这一点，这是使用模块时间的快速而肮脏的方法。 使用 selenium 可以通过特定方法等待元素出现。 我实际上有一个 go 并且无法让它工作。

请参阅此处的文档和此处了解值得了解的特定部分。

如果您查看 HTML，您会看到链接位于 class rh_button_wrapper的span元素后面，而不是 div。

使用 Selenium 和 BeautifulSoup4 抓取动态加载的 Href 属性

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-08-09 07:04:21

代码

Output

解释

使用 Selenium 和 BeautifulSoup4 抓取动态加载的 Href 属性

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-08-09 07:04:21

代码

Output

解释

解决方案1
0 已采纳 2020-08-09 07:04:21