在Python中使用LXML解析HTML

Question

I am trying to parse a website for 我正在尝试解析一个网站

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah

(there are many of these, and I want all of them in some tokenized form). （有很多这些，我希望所有这些都以一种标记化的形式）。 Unfortunately the HTML is very large and a little complicated, so trying to crawl down the tree might take me some time to just sort out the nested elements. 不幸的是，HTML非常大而且有点复杂，因此尝试爬下树可能需要一些时间来整理嵌套元素。 Is there an easy way to just retrieve this? 有没有一种简单的方法来检索它？

Thanks! 谢谢！

Answer 1

If you just want the href's for a tags, then use: 如果你只是想在href对a标签，然后使用：

data = """blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah"""

import lxml.html
tree = lxml.html.fromstring(data)
print tree.xpath('//a/@href')

# ['THIS IS WHAT I WANT']

在Python中使用LXML解析HTML

问题描述

1 个解决方案

解决方案1
14 已采纳 2013-02-02 15:59:17

在Python中使用LXML解析HTML

问题描述

1 个解决方案

解决方案1 14 已采纳 2013-02-02 15:59:17

解决方案1
14 已采纳 2013-02-02 15:59:17