Extracting href URL with Python Requests

Question

I would like to extract the URL from an xpath using the requests package in python. I can get the text but nothing I try gives the URL. Can anyone help?

ipdb> webpage.xpath(xpath_url + '/text()')
['Text of the URL']
ipdb> webpage.xpath(xpath_url + '/a()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/href()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/url()')
*** lxml.etree.XPathEvalError: Invalid expression

I used this tutorial to get started: http://docs.python-guide.org/en/latest/scenarios/scrape/

It seems like it should be easy, but nothing comes up during my searching.

Thank you.

Answer 1

Have you tried webpage.xpath(xpath_url + '/@href') ?

Here is the full code:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
webpage = html.fromstring(page.content)

webpage.xpath('//a/@href')

The result should be:

[
  'http://econpy.pythonanywhere.com/ex/002.html',
  'http://econpy.pythonanywhere.com/ex/003.html', 
  'http://econpy.pythonanywhere.com/ex/004.html',
  'http://econpy.pythonanywhere.com/ex/005.html'
]

Answer 2

You would be better served using BeautifulSoup :

from bs4 import BeautifulSoup

html = requests.get('testurl.com')
soup = BeautifulSoup(html, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want

You can print that line, add it to lists, etc. To iterate through it, use:

links = soup.find_all('a href')
for link in links:
    print(link)

Answer 3

with the benefits of a context manager:

with requests_html.HTMLSession() as s:
    try:
        r = s.get('http://econpy.pythonanywhere.com/ex/001.html')
        links = r.html.links
        for link in links:
            print(link)
    except:
        pass

Answer 4

You can do it easily with selenium.

link = webpage.find_elemnt_by_xpath(*xpath url to element with link)
url = link.get_attribute('href')

Answer 5

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.***.com')
r.html.links

Requests-HTML

Extracting href URL with Python Requests

Question

5 answers

solution1
8 ACCPTED 2015-11-20 01:27:45

solution2
2 2015-11-20 01:18:10

solution3
1 2019-04-27 19:27:25

solution4
0 2019-11-07 20:56:26

solution5
-1 2018-12-27 23:53:13

Extracting href URL with Python Requests

Question

5 answers

solution1 8 ACCPTED 2015-11-20 01:27:45

solution2 2 2015-11-20 01:18:10

solution3 1 2019-04-27 19:27:25

solution4 0 2019-11-07 20:56:26

solution5 -1 2018-12-27 23:53:13

solution1
8 ACCPTED 2015-11-20 01:27:45

solution2
2 2015-11-20 01:18:10

solution3
1 2019-04-27 19:27:25

solution4
0 2019-11-07 20:56:26

solution5
-1 2018-12-27 23:53:13