使用 Selenium (Python) 獲取具有部分字符串匹配的元素文本

Question

我正在嘗試從深深嵌套在該網頁的 HTML 內容中的<strong>標簽中提取文本： https://www.mar.netraffic.com/en/ais/details/ships/imo:9854612

例如：

strong 標簽是網頁上唯一包含字符串“cubic meters”的標簽。

我的目標是提取整個文本，即“138124 立方米液化氣”。 當我嘗試以下操作時，出現錯誤：

url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
driver.get(url)
time.sleep(3)
element = driver.find_element_by_link_text("//strong[contains(text(),'cubic meters')]").text
print(element)

錯誤：

NoSuchElementException：消息：沒有這樣的元素：無法定位元素：{“方法”：“鏈接文本”，“選擇器”：“//strong [包含（文本（），'立方米'）]”}

我在這里做錯了什么？

以下也會引發錯誤：

element = driver.find_element_by_xpath("//strong[contains(text(),'cubic')]").text

Answer 1

您可以為此使用Beautiful Soup ，更准確地說是string參數； 從文檔中，“您可以搜索字符串而不是標簽”。

作為參數，您還可以傳遞正則表達式模式。

>>> from bs4 import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup(driver.page_source, "html.parser")
>>> soup.find_all(string=re.compile(r"\d+ cubic meters"))
['173400 cubic meters Liquid Gas']

如果您確定只有一個結果，或者您只需要第一個，您也可以使用find而不是find_all 。

Answer 2

您的代碼適用於Firefox() ，但不適用於Chrome() 。

該頁面使用延遲加載，因此您必須滾動到Summary ，然后它會加載具有預期strong的文本。

我使用了一種稍微慢一點的方法——我用class='lazyload-wrapper搜索所有元素，然后在循環中滾動到該項目並檢查是否有strong 。 如果沒有任何strong ，那么我滾動到下一個class='lazyload-wrapper 。

from selenium import webdriver
import time

#driver = webdriver.Firefox()
driver = webdriver.Chrome()

url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
driver.get(url)
time.sleep(3)

from selenium.webdriver.common.action_chains import ActionChains

actions = ActionChains(driver)
elements = driver.find_elements_by_xpath("//span[@class='lazyload-wrapper']")

for number, item in enumerate(elements):
    print('--- item', number, '---')
    #print('--- before ---')
    #print(item.text)

    actions.move_to_element(item).perform()
    time.sleep(0.1)

    #print('--- after ---')
    #print(item.text)

    try:
        strong = item.find_element_by_xpath("//strong[contains(text(), 'cubic')]")
        print(strong.text)
        break
    except Exception as ex:
        #print(ex)
        pass

結果：

--- item 0 ---
--- item 1 ---
--- item 2 ---
173400 cubic meters Liquid Gas

結果顯示我可以使用elements[2]跳過兩個元素，但我不確定這段文本是否總是在第三個元素中。

在我創建我的版本之前，我測試了其他版本，這里是完整的工作代碼：

from selenium import webdriver
import time

#driver = webdriver.Firefox()
driver = webdriver.Chrome()

url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
driver.get(url)
time.sleep(3)

def test0():
    elements = driver.find_elements_by_xpath("//strong")
    for item in elements:
        print(item.text)

    print('---')

    item = driver.find_element_by_xpath("//strong[contains(text(), 'cubic')]")
    print(item.text)

def test1a():
    from selenium.webdriver.common.action_chains import ActionChains

    actions = ActionChains(driver)
    element = driver.find_element_by_xpath("//div[contains(@class,'MuiTypography-body1')][last()]//div")
    actions.move_to_element(element).build().perform()
    text = element.text
    print(text)

def test1b():
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(0.5)
    text = driver.find_element_by_xpath("//div[contains(@class,'MuiTypography-body1')][last()]//strong").text
    print(text)

def test2():
    from bs4 import BeautifulSoup
    import re
    soup = BeautifulSoup(driver.page_source, "html.parser")
    soup.find_all(string=re.compile(r"\d+ cubic meters"))

def test3():
    from selenium.webdriver.common.action_chains import ActionChains

    actions = ActionChains(driver)
    elements = driver.find_elements_by_xpath("//span[@class='lazyload-wrapper']")

    for number, item in enumerate(elements, 1):
        print('--- number', number, '---')
        #print('--- before ---')
        #print(item.text)

        actions.move_to_element(item).perform()
        time.sleep(0.1)

        #print('--- after ---')
        #print(item.text)

        try:
            strong = item.find_element_by_xpath("//strong[contains(text(), 'cubic')]")
            print(strong.text)
            break
        except Exception as ex:
            #print(ex)
            pass

#test0()
#test1a()
#test1b()
#test2()
test3()

Answer 3

您的 XPath 表達式是正確的，可以在 Chrome 中使用。 您得到NoSuchElementException ，因為該元素在您等待的 3 秒內未加載並且不存在。

要等待元素，請使用WebDriverWait class。它明確地等待元素的特定條件，並且在您的情況下存在就足夠了。

在下面的代碼中，Selenium 將等待元素在 HTML 中呈現 10 秒，每 500 毫秒輪詢一次。 您可以在此處閱讀有關WebDriverWait和條件的信息。

一些有用的信息：
不可見元素返回空字符串。 在這種情況下，您需要等待元素的可見性，或者如果元素需要滾動才能滾動到它（已添加示例）。

您還可以使用 JavaScript 從不可見元素中獲取文本。

from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium import webdriver

url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
locator = "//strong[contains(text(),'cubic meters')]"

with webdriver.Chrome() as driver:  # Type: webdriver
    wait = WebDriverWait(driver, 10)

    driver.get(url)

    cubic = wait.until(ec.presence_of_element_located((By.XPATH, locator)))  # Type: WebElement
    print(cubic.text)

    # The below examples are just for information
    # and are not needed for the case

    # Example with scroll. Scroll to the element to make it visible
    cubic.location_once_scrolled_into_view
    print(cubic.text)

    # Example using JavaScript. Works for not visible elements.
    text = driver.execute_script("return arguments[0].textContent", cubic)
    print(text)

使用mar.netraffic API是正確的。

Answer 4

我猜你應該先滾動到那個元素，然后才嘗試訪問它，包括獲取它的文本。

from selenium.webdriver.common.action_chains import ActionChains

actions = ActionChains(driver)
element = driver.find_element_by_xpath("//div[contains(@class,'MuiTypography-body1')][last()]//div")
actions.move_to_element(element).build().perform()
text = element.text

如果以上仍然不夠好，您可以像這樣滾動頁面高度一次：

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(0.5)
the_text = driver.find_element_by_xpath("//div[contains(@class,'MuiTypography-body1')][last()]//strong").text

使用 Selenium (Python) 獲取具有部分字符串匹配的元素文本

問題描述

4 個解決方案

解決方案1
1 2021-06-23 21:04:58

解決方案2
1 已采納 2021-06-23 21:42:10

解決方案3
1 2021-06-23 22:10:26

解決方案4
0 2021-06-23 21:04:50

使用 Selenium (Python) 獲取具有部分字符串匹配的元素文本

問題描述

4 個解決方案

解決方案1 1 2021-06-23 21:04:58

解決方案2 1 已采納 2021-06-23 21:42:10

解決方案3 1 2021-06-23 22:10:26

解決方案4 0 2021-06-23 21:04:50

解決方案1
1 2021-06-23 21:04:58

解決方案2
1 已采納 2021-06-23 21:42:10

解決方案3
1 2021-06-23 22:10:26

解決方案4
0 2021-06-23 21:04:50