BeautifulSoup，Selenium和Python，通过标签解析

Question

我正在尝试解析这个网站的数据

https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010

特别是，我试图在Criterion（ITC）下获取数据。 我想要的文字说CC + ECT

html中我想要的信息似乎是

<a class= js-glossary data-leg= "CC+ECT">

我是网络抓取的新手，我尝试了教程中教授的技术，但是它们没有用。 我听说过Selenium也试过了。 但是，此代码也不起作用。

from selenium import webdriver
from bs4 import BeautifulSoup
import requests

driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
data = soup.find_all("a", attrs= {"class":"js-glossary"})

代码导致一个空列表。 我还读到我可以通过像汤词一样处理汤标签来提取数据。 在这种情况下

data["data-leg"]

我是在正确的轨道上还是我离开了？

Answer 1

您尝试通过JavaScript动态生成的文本。 为了得到它你需要等待它的外观：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
text = WebDriverWait(driver, 5).until(lambda driver: driver.find_element_by_xpath('//div[.="criterion(itc)"]/following-sibling::div').text)
print(text)
#  'CC + ECT'

Answer 2

看起来你非常接近。 如果您使用Selenium，您甚至可能不需要美丽的汤 。 使用Selenium需要引入WebDriverwait以使所需元素可见 ，您可以使用以下解决方案：

代码块：

 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox(executable_path = r'C:\\Utility\\BrowserDrivers\\geckodriver.exe') driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010") print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='lbl' and text()='criterion(itc)']//following::div[1]/a"))).get_attribute("innerHTML"))

控制台输出：
```
  CC + ECT 
```

BeautifulSoup，Selenium和Python，通过标签解析

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-12-19 09:54:23

解决方案2
1 2018-12-19 10:25:03

BeautifulSoup，Selenium和Python，通过标签解析

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-12-19 09:54:23

解决方案2 1 2018-12-19 10:25:03

解决方案1
1 已采纳 2018-12-19 09:54:23

解决方案2
1 2018-12-19 10:25:03