Scraping href title using lxml and XPath

Question

from lxml import html
import requests

for i in range(44,530):      # Number of pages plus one 
    url = "http://postscapes.com/companies/r/{}".format(i)
    page = requests.get(url)
    tree = html.fromstring(page.content)

contactemail = tree.xpath('//*[@id="rt-mainbody"]/div/div/div[2]/div[4]/address/a')

print contactemail

I'm trying to scrape emails from 900 different pages on a company directory. The HTML code is relatively similar in every page. However, Contactemail returns element values . The XPath above is the href value for the code below. I'd like to extract just the title contact@23-de-enero.com from the href value via XPath, but I don't know where quite to start. I'd also like this to work for different pages, not just this href value / webpage.

<a href="mailto:contact@23-de-enero.com">contact@23-de-enero.com</a>

I've looked into regex, and tried printing with contactemail.textcontent() but it doesn't work.

Any tips?

Answer 1

There are some possible ways to extract the same value ie the email address, for example :

# get email address from inner text of the element :
print contactemail[0].text

# get email address from href attribute + substring-after() :
print contactemail[0].xpath('substring-after(@href, "mailto:")')

You can use list comprehension syntax if you may have multiple a elements in one address parent element :

print [link.text for link in contactemail]

Scraping href title using lxml and XPath

Question

1 answers

solution1
0 ACCPTED 2016-03-09 03:11:17

Scraping href title using lxml and XPath

Question

1 answers

solution1 0 ACCPTED 2016-03-09 03:11:17

solution1
0 ACCPTED 2016-03-09 03:11:17