简体   繁体   English

使用 python 中的 selenium 从网站获取价值

[英]Get value from a website using selenium in python

I am doing my first steps with Selenium in Python and want to extract a certain value from a webpage.我正在用 Python 中的 Selenium 做我的第一步,并想从网页中提取某个值。 The value i need to find on the webpage is the ID (Melde-ID), which is 355460. In the html i found the 2 lines containing my info:我需要在网页上找到的值是 ID(Melde-ID),即 355460。在 html 中,我找到了包含我的信息的 2 行:

<h3 _ngcontent-wwf-c32="" class="title"> Melde-ID: 355460 </h3><span _ngcontent-wwf-c32="">
<div _ngcontent-wwf-c27="" class="label"> Melde-ID </div><div _ngcontent-wwf-c27="" class="value">

I have been searching websites for about 2 hours for what command to use but i don't know what to actually search for in the html.我已经在网站上搜索了大约 2 个小时以查找要使用的命令,但我不知道在 html 中实际搜索什么。 The website is a html with.js modules.该网站是一个带有.js模块的html。 It works to open the URL over selenium.它可以在 selenium 上打开 URL。

(At first i tried using beautifulsoup but was not able to open the page for some restriction. I did verify that the robots.txt does not disallow anything, but the error on beautifulsoup was "Unfortunately, a problem occurred while forwarding your request to the backend server".) (起初我尝试使用 beautifulsoup 但由于某些限制而无法打开页面。我确实验证了 robots.txt 没有禁止任何内容,但 beautifulsoup 上的错误是“不幸的是,将您的请求转发到后端服务器”。)

I would be thankful for any advice and hope i did explain my issue.我会感谢任何建议,并希望我确实解释了我的问题。 The code i tried to create in Jupyter Notebook with Selenium installed is as follows:我尝试在安装了 Selenium 的 Jupyter Notebook 中创建的代码如下:

from selenium import webdriver
import codecs
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

url = "https://...."
driver = webdriver.Chrome('./chromedriver')
driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
#launch URL
driver.get(url)
#print(driver.page_source)
#Try 2
#print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[normalize-space()='Melde-ID']")))])
#close browser
driver.quit()

From the information you shared here we can see that the element containing the desired information doesn't have class name attribute with a value of Melde-ID .从您在此处共享的信息中,我们可以看到包含所需信息的元素不具有值为Melde-ID的 class 名称属性。
It has class name with value of title and contains text Melde-ID .它的名称为 class ,名称为title并包含文本Melde-ID
Also, you should use webdriver wait expected condition instead of driver.implicitly_wait(0.5) .此外,您应该使用 webdriver 等待预期条件而不是driver.implicitly_wait(0.5)
With these changes your code can be something like this:通过这些更改,您的代码可以是这样的:

from selenium import webdriver
import codecs
import os
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

url = "https://...."
driver = webdriver.Chrome('./chromedriver')

wait = WebDriverWait(driver, 20)

#maximize browser
driver.maximize_window()
#launch URL
driver.get(url)

content = wait.until(EC.visibility_of_element_located((By.XPATH, "//*[contains(@class,'title') and contains(.,'Melde-ID:')]"))).text

I added .text to extract the text from that web element.我添加了.text以从该 web 元素中提取文本。
Now content should contain Melde-ID: 355460 value.现在content应该包含Melde-ID: 355460值。

Given the HTML:鉴于 HTML:

<h3 _ngcontent-wwf-c32="" class="title"> Melde-ID: 355460 </h3>
<span _ngcontent-wwf-c32="">
    <div _ngcontent-wwf-c27="" class="label"> Melde-ID </div>
    <div _ngcontent-wwf-c27="" class="value">

To extract the text 355460 you need to induce WebDriverWait for the visibility_of_element_located() and extracting the text you have to split the text with respect to the : character and print the second part using either of the following locator strategies :要提取文本355460 ,您需要为visibility_of_element_located()引入WebDriverWait并提取文本,您必须根据:字符拆分文本并使用以下任一定位器策略打印第二部分:

  • Using CSS_SELECTOR and text attribute:使用CSS_SELECTORtext属性:

     print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h3.title"))).text.split(':')[1])
  • Using XPATH and get_attribute("innerHTML") :使用XPATHget_attribute("innerHTML")

     print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[@class='title' and text()]"))).get_attribute("innerHTML").split(':')[1])
  • Note : You have to add the following imports:注意:您必须添加以下导入:

     from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python您可以在如何使用 Selenium - Python 检索 WebElement 的文本中找到相关讨论

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM