在 python 中使用 selenium 獲取所有 href 鏈接

Question

我在 Python 中練習 Selenium，我想使用 ZC4238ZC55F06BB54076 獲取 web 頁面上的所有鏈接。

例如，我想要http://psychoticelites.com/上所有<a>標記的href=屬性中的所有鏈接

我寫了一個腳本，它正在工作。 但是，它給了我 object 地址。 我嘗試使用id標簽來獲取值，但是它不起作用。

我當前的腳本：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys


driver = webdriver.Firefox()
driver.get("http://psychoticelites.com/")

assert "Psychotic" in driver.title

continue_link = driver.find_element_by_tag_name('a')
elem = driver.find_elements_by_xpath("//*[@href]")
#x = str(continue_link)
#print(continue_link)
print(elem)

Answer 1

好吧，您必須簡單地遍歷列表：

elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

find_elements_by_*返回一個元素列表（注意“元素”的拼寫）。 遍歷列表，獲取每個元素並從中獲取所需的屬性值（在本例中為href ）。

Answer 2

我已經檢查並測試了您可以使用一個名為 find_elements_by_tag_name() 的函數。 這個例子對我來說很好。

elems = driver.find_elements_by_tag_name('a')
    for elem in elems:
        href = elem.get_attribute('href')
        if href is not None:
            print(href)

Answer 3

您可以嘗試以下方法：

    links = driver.find_elements_by_partial_link_text('')

Answer 4

driver.get(URL)
time.sleep(7)
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))
driver.close()

注意：添加延遲非常重要。 首先在調試模式下運行它並確保您的 URL 頁面正在加載。 如果頁面加載緩慢，請增加延遲（睡眠時間）然后提取。

如果您仍然遇到任何問題，請參考以下鏈接（以示例說明）或評論

使用 selenium webdriver 從網頁中提取鏈接

Answer 5

您可以在 python 中使用 html dom 庫導入 HTML dom。 你可以在這里找到它並使用 PIP 安裝它：

https://pypi.python.org/pypi/htmldom/2.0

from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")  
dom = dom.createDom()

上面的代碼創建了一個 HtmlDom 對象。HtmlDom 有一個默認參數，即頁面的 url。 創建 dom 對象后，您需要調用HtmlDom的“createDom”方法。 這將解析 html 數據並構造解析樹，然后可用於搜索和操作 html 數據。 該庫施加的唯一限制是數據無論是 html 還是 xml 都必須具有根元素。

您可以使用 HtmlDom 對象的“查找”方法查詢元素：

p_links = dom.find("a")  
for link in p_links:
  print ("URL: " +link.attr("href"))

上面的代碼將打印網頁上存在的所有鏈接/網址

Answer 6

不幸的是，OP發布的原始鏈接已經死了......

如果您正在尋找一種在頁面上抓取鏈接的方法，您可以使用gazpacho抓取此頁面上的所有“熱門網絡問題”鏈接：

from gazpacho import Soup

url = "https://stackoverflow.com/q/34759787/3731467"

soup = Soup.get(url)
a_tags = soup.find("div", {"id": "hot-network-questions"}).find("a")

[a.attrs["href"] for a in a_tags]

Answer 7

您可以通過非常簡單有效的方式使用 BeautifulSoup 來做到這一點。 我已經測試了下面的代碼並且可以很好地用於相同的目的。

在這條線之后 -

driver.get("http://psychoticelites.com/")

使用下面的代碼 -

response = requests.get(browser.current_url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('a'):
    if link.get('href'):
       print(link.get("href"))
       print('\n')

Answer 8

import requests
from selenium import webdriver
import bs4
driver = webdriver.Chrome(r'C:\chromedrivers\chromedriver') #enter the path
data=requests.request('get','https://google.co.in/') #any website
s=bs4.BeautifulSoup(data.text,'html.parser')
for link in s.findAll('a'):
    print(link)

Answer 9

現有解決帖子的更新：對於當前版本，它需要是：

elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

Answer 10

所有使用 Selenium 的driver.find_elements_by_***接受的答案都不再適用於 Selenium 4。當前的方法是使用find_elements()和By class。

方法一：for循環

下面的代碼使用了 2 個列表。 一個用於By.XPATH ，另一個用於By.TAG_NAME 。 可以使用非此即彼。 兩者都不需要。

By.XPATH IMO 是最簡單的，因為它不會像By.TAG_NAME那樣返回看似無用的None值。 該代碼還刪除了重復項。

from selenium.webdriver.common.by import By

driver.get("https://www.amazon.com/")

href_links = []
href_links2 = []

elems = driver.find_elements(by=By.XPATH, value="//a[@href]")
elems2 = driver.find_elements(by=By.TAG_NAME, value="a")

for elem in elems:
    l = elem.get_attribute("href")
    if l not in href_links:
        href_links.append(l)

for elem in elems2:
    l = elem.get_attribute("href")
    if (l not in href_links2) & (l is not None):
        href_links2.append(l)

print(len(href_links))  # 360
print(len(href_links2))  # 360

print(href_links == href_links2)  # True

方法二：列表理解

如果重復是可以的，可以使用一個線性列表理解。

from selenium.webdriver.common.by import By

driver.get("https://www.amazon.com/")

elems = driver.find_elements(by=By.XPATH, value="//a[@href]")
href_links = [e.get_attribute("href") for e in elems]

elems2 = driver.find_elements(by=By.TAG_NAME, value="a")
# href_links2 = [e.get_attribute("href") for e in elems2]  # Does not remove None values
href_links2 = [e.get_attribute("href") for e in elems2 if e.get_attribute("href") is not None]

print(len(href_links))  # 387
print(len(href_links2))  # 387

print(href_links == href_links2)  # True

在 python 中使用 selenium 獲取所有 href 鏈接

問題描述

10 個解決方案

解決方案1
97 已采納 2016-01-13 06:33:29

解決方案2
7 2020-04-29 23:43:22

解決方案3
3 2017-08-31 11:44:17

解決方案4
3 2021-06-12 15:28:56

解決方案5
2 2017-02-21 13:09:46

解決方案6
1 2020-10-10 00:40:45

解決方案7
1 2021-06-26 10:25:44

解決方案8
0 2019-08-01 11:46:03

解決方案9
0 2022-07-05 08:42:21

解決方案10
0 2022-08-09 11:51:52

方法一：for循環

方法二：列表理解

在 python 中使用 selenium 獲取所有 href 鏈接

問題描述

10 個解決方案

解決方案1 97 已采納 2016-01-13 06:33:29

解決方案2 7 2020-04-29 23:43:22

解決方案3 3 2017-08-31 11:44:17

解決方案4 3 2021-06-12 15:28:56

解決方案5 2 2017-02-21 13:09:46

解決方案6 1 2020-10-10 00:40:45

解決方案7 1 2021-06-26 10:25:44

解決方案8 0 2019-08-01 11:46:03

解決方案9 0 2022-07-05 08:42:21

解決方案10 0 2022-08-09 11:51:52

方法一：for循環

方法二：列表理解

解決方案1
97 已采納 2016-01-13 06:33:29

解決方案2
7 2020-04-29 23:43:22

解決方案3
3 2017-08-31 11:44:17

解決方案4
3 2021-06-12 15:28:56

解決方案5
2 2017-02-21 13:09:46

解決方案6
1 2020-10-10 00:40:45

解決方案7
1 2021-06-26 10:25:44

解決方案8
0 2019-08-01 11:46:03

解決方案9
0 2022-07-05 08:42:21

解決方案10
0 2022-08-09 11:51:52