繁体   English   中英

使用 Python 和 Selenium 从多个工具提示中抓取数据

[英]Scraping data from multiple tooltips using Python and Selenium

我正在尝试使用 Python 和 Selenium 从本网站刮取硫化氢数据。 到目前为止,我一直在苦苦挣扎的是我不知道如何获取每个工具提示的数据(站点 ID、站点名称、日期、值、单位等)。 如您所见,我们有从 A 到 G 的七个监控点,每个点对应自己的数据。 我做了很多研究,但仍然卡住了。 我已经编译了以下代码来抓取特定日期的数据,但遇到了错误。 请在下面查看我的代码。

from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")

# Navigate to monitors
button = driver.find_element_by_xpath("//div[@class='nav-link-text']")   
button.click()

# Navigate to dropdown button
dropdown = driver.find_element_by_xpath("//i[@class='arrow-down parameter-arrow']") 
dropdown.click()

# Select Hydrogen Sulfide and click
h2s = driver.find_element_by_xpath("//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]")
h2s.click()

res = []
test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
for ele in test:
    hover = ActionChains(driver).move_to_element(ele)
    hover.perform()
    try:
        site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
        site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
        date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
        value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
        unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
        para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
        res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
    except:
        pass

如果有人能帮我解决这个问题,我真的很感激。 另外,我想通过利用上述代码在 window 时间(比如说从 2021 年 8 月 1 日到 2022 年 1 月 1 日)抓取数据,因此非常感谢任何反馈。

看起来您需要的所有代码都是一些 WebdriverWaits。 如果我没记错的话,基于 React 的网站在自动化方面有点困难,因为有很多 aysncs 并且由于虚拟 DOM。 我已经根据需要使用 WebdriverWaits 重构了您的代码(并且还消除了多行,尽管如果您想要更好的可读性,您可以保留它们)。 这是代码:

from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='nav-link-text']"))).click()
# Navigate to monitors
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//i[@class='arrow-down parameter-arrow']"))).click()
# Navigate to dropdown button
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]"))).click()
# Select Hydrogen Sulfide and click
WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")))
driver.find_element_by_css_selector(".arrow-down.date-arrow").click()
req_month = 'Aug'
req_year = '2021'
req_timeline = req_month + " " + req_year
print(f"Timeline Selected is: {req_timeline}")
for i in range(11):
    month = driver.find_element(By.XPATH, "//th[@class='month']").text
    if month == req_timeline:
        break
    else:
        driver.find_element(By.XPATH, "//th[@class='prev available']").click()
driver.find_element(By.XPATH, "//*[@class='table-condensed']//td[text()='1']").click()
driver.find_element(By.XPATH, "//*[text()='Apply']").click()
time.sleep(8)
res = []
test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
for ele in test:
    hover = ActionChains(driver).move_to_element(ele)
    hover.perform()
    time.sleep(1)
    try:
        site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
        site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
        date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
        value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
        unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
        para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
        res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
    except:
        pass
print(res)

结果如下:

Timeline Selected is: Aug 2021
[('F', 'Point Monitor', '7:55 AM', '1.80', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '7:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '7:55 AM', '1.10', 'ppb', 'MDL: 0.40 ppb'), ('A', 'Point Monitor', '7:55 AM', '0.40', 'ppb', 'MDL: 0.40 ppb')]

Process finished with exit code 0

您会看到甚至引入了 WebdriverWaits,有些地方需要在time.sleep上硬停止,否则测试会变得不稳定。

@ThaiNguyen,添加另一个答案以保留较早的答案。 我尝试了一些粗略的方法来完成工作,经过多次尝试后我成功了,但我会说一点点盐,因为我在 8 月只迭代了 3 个日期。重构的代码粘贴在下面,但是在你看到代码之前,让我解释一下我所面临的问题,你可以标记一下。 为了让 DOM 为每个动作稳定下来,我必须添加很多睡眠(如您所知,time.sleep 在异步方面非常不可靠),但我认为即使在等待之后我也看到代码失败陈旧的元素,增加时间帮助我(暂时)照顾它们。 另一件事——在我看来,这是一个大问题:即使这段代码成功地获取了结果,我也不能保证它会在 8 月的所有日期(更不用说所有需要的月份)都这样做,因为代码在渲染的 DOM 中表现得非常不稳定,我不想在这个时间点责怪代码(我对 selenium 的了解有限),但如果我没记错的话,DOM 有严重的异步。 所以,我想说的是,使用这段代码,你不能指望一下子就搞定一切。 相反,您可能不得不将时间花在重构代码和改进代码上,或者通过在每个月的几个日期一次运行多次来分块获取数据,考虑到它所欠的脆弱性,这非常令人沮丧。

from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
def h2s_selection():
    driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='nav-link-text']"))).click()
    # Navigate to monitors
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//i[@class='arrow-down parameter-arrow']"))).click()
    # Navigate to dropdown button
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]"))).click()
    # Select Hydrogen Sulfide and click
    WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")))

def aug_date():
    driver.find_element_by_css_selector(".arrow-down.date-arrow").click()
    req_month = 'Aug'
    req_year = '2021'
    req_timeline = req_month + " " + req_year
    print(f"Timeline Selected is: {req_timeline}")
    for i in range(11):
        month = driver.find_element(By.XPATH, "//th[@class='month']").text
        if month == req_timeline:
            break
        else:
            driver.find_element(By.XPATH, "//th[@class='prev available']").click()
    dt = ['1', '2', '3']
    for i in dt:
        time.sleep(5)
        each_date = driver.find_element(By.XPATH, "//*[@class='table-condensed']//td[text()=" + i + ']')
        print(f"Date is {each_date.text}")
        each_date.click()
        driver.find_element(By.XPATH, "//*[text()='Apply']").click()
        time.sleep(10)
        tooltips()
        time.sleep(5)
        driver.find_element_by_css_selector(".arrow-down.date-arrow").click()

def tooltips():
    # time.sleep(8)
    res = []
    test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
    for ele in test:
        hover = ActionChains(driver).move_to_element(ele)
        hover.perform()
        time.sleep(1)
        try:
            site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
            site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
            date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
            value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
            unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
            para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
            res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
        except:
            pass
    print(res)


if __name__ == "__main__":
    h2s_selection()
    aug_date()

Output:

Timeline Selected is: Aug 2021
Date is 1
[('F', 'Point Monitor', '10:55 AM', '0.90', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '10:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '10:55 AM', '1.30', 'ppb', 'MDL: 0.40 ppb'), ('A', 'Point Monitor', '10:55 AM', '0.60', 'ppb', 'MDL: 0.40 ppb')]
Date is 2
[('B', 'Point Monitor', '10:25 PM', '1.70', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '10:25 PM', '1.90', 'ppb', 'MDL: 0.40 ppb')]
Date is 3
[('F', 'Point Monitor', '9:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '9:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '9:55 AM', '1.90', 'ppb', 'MDL: 0.40 ppb'), ('A', 'Point Monitor', '9:55 AM', '0.50', 'ppb', 'MDL: 0.40 ppb')]

Process finished with exit code 0

自动化基于 React 的网站比传统网站要困难一些。 话虽如此,我详细查看了您的代码并尝试复制它。 我发现在您的循环中,您将ele分配给 hover over,但ele本身是一个索引,而不是元素列表中的元素,因此 webdriver 正在尝试使用 ele 查找元素,这只是一个索引和因此出错了。 这就是我的想法。 我已经稍微调整了你的代码,并得到了结果。 请检查这是否是您想要的。

res = []
leaflet_pane = "//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]"
test = driver.find_elements_by_xpath(leaflet_pane)
for ele in range(len(test)):
    each_test = driver.find_element_by_xpath('(' + leaflet_pane + ')' + '[' + str(ele+1) + ']')
    hover = ActionChains(driver).move_to_element(each_test)
    hover.perform()
    time.sleep(1)
    try:
        site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id")
        site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
        date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
        value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
        unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
        para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
        res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
    except:
        pass
    time.sleep(1)

这是 output:

[('G', 'Point Monitor', '10:20 PM', '0.90', 'ppb', 'MDL: 0.40 ppb'), ('F', 'Point Monitor', '10:20 PM', '0.40', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '10:20 PM', '1.00', 'ppb', 'MDL: 0.40 ppb'), ('C', 'Point Monitor', '10:20 PM', '1.00', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '10:20 PM', '0.90', 'ppb', 'MDL: 0.40 ppb')]

我必须赞扬你为这个相当难以抓取的网站编写了一个体面的代码。

@AnandGautam,我意识到每当我想抓取整整一个月的数据时(比如说 2021 年 9 月),一切都很顺利,直到我到达 29 日,我们在同一个日历上有 8 月 29 日和 9 月 29 日。 因此,为了使 Xpath 独一无二,我通过添加@data-title=对其进行了一些修改。 但是我遇到了一些错误。 我试图验证 Xpath 并发现它是有效的,所以我仍然不知道为什么会发生错误。 请参阅下面的代码。

from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
def h2s_selection():
    driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='nav-link-text']"))).click()
    # Navigate to monitors
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//i[@class='arrow-down parameter-arrow']"))).click()
    # Navigate to dropdown button
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]"))).click()
    # Select Hydrogen Sulfide and click
    WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")))

def month_data(req_month, req_year, data):
    driver.find_element_by_css_selector(".arrow-down.date-arrow").click()
    req_timeline = req_month + " " + req_year
    print(f"Timeline Selected is: {req_timeline}")
    for i in range(11):
        month = driver.find_element(By.XPATH, "//th[@class='month']").text
        if month == req_timeline:
            break
        else:
            driver.find_element(By.XPATH, "//th[@class='prev available']").click()
    
    for k, v in data.items():
        time.sleep(5)
        each_date = driver.find_element(By.XPATH, f"//*[@class='table-condensed']//td[text()={k} and @data-title={v}]")
        #print(f"Date is {each_date.text}")
        each_date.click()
        driver.find_element(By.XPATH, "//*[text()='Apply']").click()
        time.sleep(10)
        tooltips()
        time.sleep(5)
        driver.find_element_by_css_selector(".arrow-down.date-arrow").click()

def tooltips():
    # time.sleep(8)
    res = []
    test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
    for ele in test:
        hover = ActionChains(driver).move_to_element(ele)
        hover.perform()
        time.sleep(1)
        try:
            site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
            site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
            date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
            value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
            unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
            para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
            res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
        except:
            pass
    print(res)


if __name__ == "__main__":
    h2s_selection()
    data_dict = {'29': 'r4c3', '30': 'r4c4'}
    month_data(req_month='Sep', req_year='2021', data=data_dict)

如果你能给我一些关于如何解决这个问题的指示/反馈,我真的很感激。 谢谢!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM