简体   繁体   English

如何使用lxml,XPath和Python从网页中提取链接?

[英]How to extract links from a webpage using lxml, XPath and Python?

I've got this xpath query: 我有这个xpath查询:

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on . 它使用title属性提取所有链接 - 并在FireFox的Xpath检查程序附加组件中提供href

However, I cannot seem to use it with lxml . 但是,我似乎无法使用它与lxml

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

This produces no result from lxml (empty list). 这不会产生lxml (空列表)的结果。

How would one grab the href text (link) of a hyperlink containing the attribute title with lxml under Python? 如何在Python下抓取包含lxml属性标题的超链接的href文本(链接)?

I was able to make it work with the following code: 我能够使用以下代码:

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head/>
<body>
    <table border="1">
      <tbody>
        <tr>
          <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
        </tr>
        <tr>
          <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
        </tr>
      </tbody>
    </table>
</body>
</html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

Firefox adds additional html tags to the html when it renders, making the xpath returned by the firebug tool inconsistent with the actual html returned by the server (and what urllib/2 will return). Firefox在呈现时向html 添加了额外的html标记 ,使得firebug工具返回的xpath与服务器返回的实际html不一致(以及urllib / 2将返回的内容)。

Removing the <tbody> tag generally does the trick. 删除<tbody>标签通常可以解决问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM