简体   繁体   English

Python Selenium - 获取 href 值

[英]Python Selenium - get href value

I am trying to copy the href value from a website, and the html code looks like this:我正在尝试从网站复制 href 值,html 代码如下所示:

<p class="sc-eYdvao kvdWiq">
 <a href="https://www.iproperty.com.my/property/setia-eco-park/sale- 
 1653165/">Shah Alam Setia Eco Park, Setia Eco Park
 </a>
</p>

I've tried driver.find_elements_by_css_selector(".sc-eYdvao.kvdWiq").get_attribute("href") but it returned 'list' object has no attribute 'get_attribute' .我试过driver.find_elements_by_css_selector(".sc-eYdvao.kvdWiq").get_attribute("href")但它返回'list' object has no attribute 'get_attribute' Using driver.find_element_by_css_selector(".sc-eYdvao.kvdWiq").get_attribute("href") returned None .使用driver.find_element_by_css_selector(".sc-eYdvao.kvdWiq").get_attribute("href")返回None But i cant use xpath because the website has like 20+ href which i need to copy all.但我不能使用 xpath,因为该网站有 20 多个 href,我需要将其全部复制。 Using xpath would only copy one.使用 xpath 只会复制一个。

If it helps, all the 20+ href are categorised under the same class which is sc-eYdvao kvdWiq .如果有帮助,所有 20+ href 都归类在同一个 class 下,即sc-eYdvao kvdWiq

Ultimately i would want to copy all the 20+ href and export them out to a csv file.最终我想复制所有 20+ href 并将它们导出到一个 csv 文件。

Appreciate any help possible.感谢任何可能的帮助。

You want driver.find_elements if more than one element.如果有多个元素,您需要 driver.find_elements。 This will return a list.这将返回一个列表。 For the css selector you want to ensure you are selecting for those classes that have a child href对于 css 选择器,您要确保为那些具有子 href 的类选择

elems = driver.find_elements_by_css_selector(".sc-eYdvao.kvdWiq [href]")
links = [elem.get_attribute('href') for elem in elems]

You might also need a wait condition for presence of all elements located by css selector.您可能还需要一个等待条件,以便 css 选择器定位的所有元素都存在。

elems = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".sc-eYdvao.kvdWiq [href]")))

As per the given HTML:根据给定的 HTML:

<p class="sc-eYdvao kvdWiq">
    <a href="https://www.iproperty.com.my/property/setia-eco-park/sale-1653165/">Shah Alam Setia Eco Park, Setia Eco Park</a>
</p>

As the href attribute is within the <a> tag ideally you need to move deeper till the <a> node.由于href属性在理想情况下位于<a>标签内,因此您需要深入到<a>节点。 So to extract the value of the href attribute you can use either of the following Locator Strategies :因此,要提取href属性的值,您可以使用以下任一定位器策略

  • Using css_selector :使用css_selector

     print(driver.find_element_by_css_selector("p.sc-eYdvao.kvdWiq > a").get_attribute('href'))
  • Using xpath :使用xpath

     print(driver.find_element_by_xpath("//p[@class='sc-eYdvao kvdWiq']/a").get_attribute('href'))

If you want to extract all the values of the href attribute you need to use find_elements* instead:如果要提取href属性的所有值,则需要使用find_elements*代替:

  • Using css_selector :使用css_selector

     print([my_elem.get_attribute("href") for my_elem in driver.find_elements_by_css_selector("p.sc-eYdvao.kvdWiq > a")])
  • Using xpath :使用xpath

     print([my_elem.get_attribute("href") for my_elem in driver.find_elements_by_xpath("//p[@class='sc-eYdvao kvdWiq']/a")])

Dynamic elements动态元素

However, if you observe the values of class attributes ie sc-eYdvao and kvdWiq ideally those are dynamic values.但是,如果您观察属性的,即sc-eYdvaokvdWiq理想情况下,它们是动态值。 So to extract the href attribute you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies :因此,要提取href属性,您必须为visibility_of_element_located()引入WebDriverWait ,您可以使用以下任一定位器策略

  • Using CSS_SELECTOR :使用CSS_SELECTOR

     print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "p.sc-eYdvao.kvdWiq > a"))).get_attribute('href'))
  • Using XPATH :使用XPATH

     print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//p[@class='sc-eYdvao kvdWiq']/a"))).get_attribute('href'))

If you want to extract all the values of the href attribute you can use visibility_of_all_elements_located() instead:如果要提取href属性的所有值,可以改用visibility_of_all_elements_located()

  • Using CSS_SELECTOR :使用CSS_SELECTOR

     print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "p.sc-eYdvao.kvdWiq > a")))])
  • Using XPATH :使用XPATH

     print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//p[@class='sc-eYdvao kvdWiq']/a")))])

Note : You have to add the following imports :注意:您必须添加以下导入:

from selenium.webdriver.support.ui import WebDriverWait     
from selenium.webdriver.common.by import By     
from selenium.webdriver.support import expected_conditions as EC

The XPATH XPATH

//p[@class='sc-eYdvao kvdWiq']/a

return the elements you are looking for.返回您要查找的元素。

Writing the data to CSV file is not related to the scraping challenge.将数据写入 CSV 文件与抓取挑战无关。 Just try to look at examples and you will be able to do it.只需尝试查看示例,您就可以做到。

To crawl any hyperlink or Href, proxycrwal API is ideal as it uses pre-built functions for fetching desired information.要抓取任何超链接或 Href,proxycrwal API 是理想的选择,因为它使用预构建的函数来获取所需的信息。 Just pip install the API and follow the code to get the required output.只需 pip install API 并按照代码获取所需的输出。 The second approach to fetch Href links using python selenium is to run the following code.使用 python selenium 获取 Href 链接的第二种方法是运行以下代码。

Source Code:源代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time

list = ['https://www.heliosholland.com/Ampullendoos-voor-63-ampullen','https://www.heliosholland.com/lege-testdozen’]
driver = webdriver.Chrome()
wait = WebDriverWait(driver,29)

for i in list: 
  driver.get(i)
  image = wait.until(EC.visibility_of_element_located((By.XPATH,'/html/body/div[1]/div[3]/div[2]/div/div[2]/div/div/form/div[1]/div[1]/div/div/div/div[1]/div/img'))).get_attribute('src')
  print(image)

To scrape the link, use .get_attribute('src').要抓取链接,请使用 .get_attribute('src')。

Get the whole element you want with driver.find_elements(By.XPATH, 'path') .使用driver.find_elements(By.XPATH, 'path')获取您想要的整个元素。 To extract the href link use get_attribute('href') .要提取 href 链接,请使用get_attribute('href') Which gives,这使,

driver.find_elements(By.XPATH, 'path').get_attribute('href')

try something like:尝试类似:

elems = driver.find_elements_by_xpath("//p[contains(@class, 'sc-eYdvao') and contains(@class='kvdWiq')]/a")
for elem in elems:
   print elem.get_attribute['href']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM