简体   繁体   中英

How to extract links from a webpage using lxml, XPath and Python?

I've got this xpath query:

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on .

However, I cannot seem to use it with lxml .

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

This produces no result from lxml (empty list).

How would one grab the href text (link) of a hyperlink containing the attribute title with lxml under Python?

I was able to make it work with the following code:

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head/>
<body>
    <table border="1">
      <tbody>
        <tr>
          <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
        </tr>
        <tr>
          <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
        </tr>
      </tbody>
    </table>
</body>
</html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

Firefox adds additional html tags to the html when it renders, making the xpath returned by the firebug tool inconsistent with the actual html returned by the server (and what urllib/2 will return).

Removing the <tbody> tag generally does the trick.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM