Namespace argument in lxml parsing

Question

I have an html page that I am trying to parse. Here is what I'm doing with lxml:

node=etree.fromstring(html)
>>> node
<Element {http://www.w3.org/1999/xhtml}html at 0x110676a70>
>>> node.xpath('//body')
[]
>>> node.xpath('body')
[]

Unfortunately, all my xpath calls are now returning an empty list. Why is this occurring and how would I fix this call?

Answer 1

You can add a namespace here, as follows:

>>> node.xpath('//xmlns:tr', namespaces={'xmlns':'http://www.w3.org/1999/xhtml'})
[<Element {http://www.w3.org/1999/xhtml}tr at 0x11067b6c8>, <Element {http://www.w3.org/1999/xhtml}tr at 0x11067b710>]

And a better way to do it would be with using lxml's html parser:

>>> node=lxml.html.fromstring(html)
>>> node.findall('body')
[<Element body at 0x1106b8f18>]

Answer 2

You need to use the namespace prefix while querying. like

node.xpath('//html:body', namespaces={'html': 'http://...'})

or you can use the .nsmap

node.xpath('//html:body', namespaces=node.nsmap)

This assumes all the namespaces are defined on tag pointed by node . This is usually true for most xml documents.

Namespace argument in lxml parsing

Question

2 answers

solution1
1 2015-02-08 20:58:59

solution2
1 ACCPTED 2015-02-08 21:16:01

Namespace argument in lxml parsing

Question

2 answers

solution1 1 2015-02-08 20:58:59

solution2 1 ACCPTED 2015-02-08 21:16:01

solution1
1 2015-02-08 20:58:59

solution2
1 ACCPTED 2015-02-08 21:16:01