简体   繁体   English

使用 lxml 和 XPath 抓取 href 标题

[英]Scraping href title using lxml and XPath

from lxml import html
import requests

for i in range(44,530):      # Number of pages plus one 
    url = "http://postscapes.com/companies/r/{}".format(i)
    page = requests.get(url)
    tree = html.fromstring(page.content)

contactemail = tree.xpath('//*[@id="rt-mainbody"]/div/div/div[2]/div[4]/address/a')

print contactemail

I'm trying to scrape emails from 900 different pages on a company directory.我正在尝试从公司目录中的 900 个不同页面中抓取电子邮件。 The HTML code is relatively similar in every page.每个页面的 HTML 代码都比较相似。 However, Contactemail returns element values .但是, Contactemail 返回元素值 The XPath above is the href value for the code below.上面的 XPath 是下面代码的 href 值。 I'd like to extract just the title contact@23-de-enero.com from the href value via XPath, but I don't know where quite to start.只想通过 XPath 从 href 值中提取标题contact@23-de-enero.com ,但我不知道从哪里开始。 I'd also like this to work for different pages, not just this href value / webpage.我也希望这适用于不同的页面,而不仅仅是这个 href 值/网页。

<a href="mailto:contact@23-de-enero.com">contact@23-de-enero.com</a>

I've looked into regex, and tried printing with contactemail.textcontent() but it doesn't work.我研究了正则表达式,并尝试使用contactemail.textcontent()打印,但它不起作用。

Any tips?有小费吗?

There are some possible ways to extract the same value ie the email address, for example :有一些可能的方法来提取相同的值,即电子邮件地址,例如:

# get email address from inner text of the element :
print contactemail[0].text

# get email address from href attribute + substring-after() :
print contactemail[0].xpath('substring-after(@href, "mailto:")')

You can use list comprehension syntax if you may have multiple a elements in one address parent element :如果在一个address父元素中可能有多个a元素,则可以使用列表理解语法:

print [link.text for link in contactemail]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM