Parse HTML using LXML in Python

Question

I am trying to parse a website for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah

(there are many of these, and I want all of them in some tokenized form). Unfortunately the HTML is very large and a little complicated, so trying to crawl down the tree might take me some time to just sort out the nested elements. Is there an easy way to just retrieve this?

Thanks!

Answer 1

If you just want the href's for a tags, then use:

data = """blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah"""

import lxml.html
tree = lxml.html.fromstring(data)
print tree.xpath('//a/@href')

# ['THIS IS WHAT I WANT']

Parse HTML using LXML in Python

Question

1 answers

solution1
14 ACCPTED 2013-02-02 15:59:17

Parse HTML using LXML in Python

Question

1 answers

solution1 14 ACCPTED 2013-02-02 15:59:17

solution1
14 ACCPTED 2013-02-02 15:59:17