简体   繁体   English

在 python 中使用 selenium 获取所有 href 链接

[英]Fetch all href link using selenium in python

I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium.我在 Python 中练习 Selenium,我想使用 ZC4238ZC55F06BB54076 获取 web 页面上的所有链接。

For example, I want all the links in the href= property of all the <a> tags on http://psychoticelites.com/例如,我想要http://psychoticelites.com/上所有<a>标记的href=属性中的所有链接

I've written a script and it is working.我写了一个脚本,它正在工作。 But, it's giving me the object address.但是,它给了我 object 地址。 I've tried using the id tag to get the value, but, it doesn't work.我尝试使用id标签来获取值,但是它不起作用。

My current script:我当前的脚本:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys


driver = webdriver.Firefox()
driver.get("http://psychoticelites.com/")

assert "Psychotic" in driver.title

continue_link = driver.find_element_by_tag_name('a')
elem = driver.find_elements_by_xpath("//*[@href]")
#x = str(continue_link)
#print(continue_link)
print(elem)

Well, you have to simply loop through the list:好吧,您必须简单地遍历列表:

elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

find_elements_by_* returns a list of elements (note the spelling of 'elements'). find_elements_by_*返回一个元素列表(注意“元素”的拼写)。 Loop through the list, take each element and fetch the required attribute value you want from it (in this case href ).遍历列表,获取每个元素并从中获取所需的属性值(在本例中为href )。

I have checked and tested that there is a function named find_elements_by_tag_name() you can use.我已经检查并测试了您可以使用一个名为 find_elements_by_tag_name() 的函数。 This example works fine for me.这个例子对我来说很好。

elems = driver.find_elements_by_tag_name('a')
    for elem in elems:
        href = elem.get_attribute('href')
        if href is not None:
            print(href)

您可以尝试以下方法:

    links = driver.find_elements_by_partial_link_text('')
driver.get(URL)
time.sleep(7)
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))
driver.close()

Note: Adding delay is very important.注意:添加延迟非常重要。 First run it in debug mode and Make sure your URL page is getting loaded.首先在调试模式下运行它并确保您的 URL 页面正在加载。 If the page is loading slowly, increase delay (sleep time) and then extract.如果页面加载缓慢,请增加延迟(睡眠时间)然后提取。

If you still face any issues, please refer below link (explained with an example) or comment如果您仍然遇到任何问题,请参考以下链接(以示例说明)或评论

Extract links from webpage using selenium webdriver 使用 selenium webdriver 从网页中提取链接

You can import the HTML dom using html dom library in python.您可以在 python 中使用 html dom 库导入 HTML dom。 You can find it over here and install it using PIP:你可以在这里找到它并使用 PIP 安装它:

https://pypi.python.org/pypi/htmldom/2.0 https://pypi.python.org/pypi/htmldom/2.0

from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")  
dom = dom.createDom()

The above code creates a HtmlDom object.The HtmlDom takes a default parameter, the url of the page.上面的代码创建了一个 HtmlDom 对象。HtmlDom 有一个默认参数,即页面的 url。 Once the dom object is created, you need to call "createDom" method of HtmlDom .创建 dom 对象后,您需要调用HtmlDom的“createDom”方法。 This will parse the html data and constructs the parse tree which then can be used for searching and manipulating the html data.这将解析 html 数据并构造解析树,然后可用于搜索和操作 html 数据。 The only restriction the library imposes is that the data whether it is html or xml must have a root element.该库施加的唯一限制是数据无论是 html 还是 xml 都必须具有根元素。

You can query the elements using the "find" method of HtmlDom object:您可以使用 HtmlDom 对象的“查找”方法查询元素:

p_links = dom.find("a")  
for link in p_links:
  print ("URL: " +link.attr("href"))

The above code will print all the links/urls present on the web page上面的代码将打印网页上存在的所有链接/网址

Unfortunately, the original link posted by OP is dead...不幸的是,OP发布的原始链接已经死了......

If you're looking for a way to scrape links on a page, here's how you can scrape all of the "Hot Network Questions" links on this page with gazpacho :如果您正在寻找一种在页面上抓取链接的方法,您可以使用gazpacho抓取此页面上的所有“热门网络问题”链接:

from gazpacho import Soup

url = "https://stackoverflow.com/q/34759787/3731467"

soup = Soup.get(url)
a_tags = soup.find("div", {"id": "hot-network-questions"}).find("a")

[a.attrs["href"] for a in a_tags]

You can do this by using BeautifulSoup with very easy and efficient way.您可以通过非常简单有效的方式使用 BeautifulSoup 来做到这一点。 I have tested the below codes and worked fine for the same purpose.我已经测试了下面的代码并且可以很好地用于相同的目的。

After this line -在这条线之后 -

driver.get("http://psychoticelites.com/")

use the below code -使用下面的代码 -

response = requests.get(browser.current_url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('a'):
    if link.get('href'):
       print(link.get("href"))
       print('\n')
import requests
from selenium import webdriver
import bs4
driver = webdriver.Chrome(r'C:\chromedrivers\chromedriver') #enter the path
data=requests.request('get','https://google.co.in/') #any website
s=bs4.BeautifulSoup(data.text,'html.parser')
for link in s.findAll('a'):
    print(link)

Update for the existing solving Post: For the current version it needs to be:现有解决帖子的更新:对于当前版本,它需要是:

elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

All of the accepted answers using Selenium's driver.find_elements_by_*** no longer work with Selenium 4. The current method is to use find_elements() with the By class.所有使用 Selenium 的driver.find_elements_by_***接受的答案都不再适用于 Selenium 4。当前的方法是使用find_elements()By class。

Method 1: For loop方法一:for循环

The below code utilizes 2 lists.下面的代码使用了 2 个列表。 One for By.XPATH and the other, By.TAG_NAME .一个用于By.XPATH ,另一个用于By.TAG_NAME One can use either-or.可以使用非此即彼。 Both are not needed.两者都不需要。

By.XPATH IMO is the easiest as it does not return a seemingly useless None value like By.TAG_NAME does. By.XPATH IMO 是最简单的,因为它不会像By.TAG_NAME那样返回看似无用的None值。 The code also removes duplicates.该代码还删除了重复项。

from selenium.webdriver.common.by import By

driver.get("https://www.amazon.com/")

href_links = []
href_links2 = []

elems = driver.find_elements(by=By.XPATH, value="//a[@href]")
elems2 = driver.find_elements(by=By.TAG_NAME, value="a")

for elem in elems:
    l = elem.get_attribute("href")
    if l not in href_links:
        href_links.append(l)

for elem in elems2:
    l = elem.get_attribute("href")
    if (l not in href_links2) & (l is not None):
        href_links2.append(l)

print(len(href_links))  # 360
print(len(href_links2))  # 360

print(href_links == href_links2)  # True

Method 2: List Comprehention方法二:列表理解

If duplicates are OK, one liner list comprehension can be used.如果重复是可以的,可以使用一个线性列表理解。

from selenium.webdriver.common.by import By

driver.get("https://www.amazon.com/")

elems = driver.find_elements(by=By.XPATH, value="//a[@href]")
href_links = [e.get_attribute("href") for e in elems]

elems2 = driver.find_elements(by=By.TAG_NAME, value="a")
# href_links2 = [e.get_attribute("href") for e in elems2]  # Does not remove None values
href_links2 = [e.get_attribute("href") for e in elems2 if e.get_attribute("href") is not None]

print(len(href_links))  # 387
print(len(href_links2))  # 387

print(href_links == href_links2)  # True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM