使用 lxml 和 XPath 抓取 href 標題

Question

from lxml import html
import requests

for i in range(44,530):      # Number of pages plus one 
    url = "http://postscapes.com/companies/r/{}".format(i)
    page = requests.get(url)
    tree = html.fromstring(page.content)

contactemail = tree.xpath('//*[@id="rt-mainbody"]/div/div/div[2]/div[4]/address/a')

print contactemail

我正在嘗試從公司目錄中的 900 個不同頁面中抓取電子郵件。 每個頁面的 HTML 代碼都比較相似。 但是， Contactemail 返回元素值。 上面的 XPath 是下面代碼的 href 值。 我只想通過 XPath 從 href 值中提取標題contact@23-de-enero.com ，但我不知道從哪里開始。 我也希望這適用於不同的頁面，而不僅僅是這個 href 值/網頁。

<a href="mailto:contact@23-de-enero.com">contact@23-de-enero.com</a>

我研究了正則表達式，並嘗試使用contactemail.textcontent()打印，但它不起作用。

有小費嗎？

Answer 1

有一些可能的方法來提取相同的值，即電子郵件地址，例如：

# get email address from inner text of the element :
print contactemail[0].text

# get email address from href attribute + substring-after() :
print contactemail[0].xpath('substring-after(@href, "mailto:")')

如果在一個address父元素中可能有多個a元素，則可以使用列表理解語法：

print [link.text for link in contactemail]

使用 lxml 和 XPath 抓取 href 標題

問題描述

1 個解決方案

解決方案1
0 已采納 2016-03-09 03:11:17

使用 lxml 和 XPath 抓取 href 標題

問題描述

1 個解決方案

解決方案1 0 已采納 2016-03-09 03:11:17

解決方案1
0 已采納 2016-03-09 03:11:17