简体   繁体   English

在Python中使用LXML解析HTML

[英]Parse HTML using LXML in Python

I am trying to parse a website for 我正在尝试解析一个网站

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah 

(there are many of these, and I want all of them in some tokenized form). (有很多这些,我希望所有这些都以一种标记化的形式)。 Unfortunately the HTML is very large and a little complicated, so trying to crawl down the tree might take me some time to just sort out the nested elements. 不幸的是,HTML非常大而且有点复杂,因此尝试爬下树可能需要一些时间来整理嵌套元素。 Is there an easy way to just retrieve this? 有没有一种简单的方法来检索它?

Thanks! 谢谢!

If you just want the href's for a tags, then use: 如果你只是想在href对a标签,然后使用:

data = """blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah"""

import lxml.html
tree = lxml.html.fromstring(data)
print tree.xpath('//a/@href')

# ['THIS IS WHAT I WANT']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM