lxml XPath - how to get concatenated text from node

Question

I have a node like

<a class="someclass">
Wie
<em>Messi</em>
einen kleinen Jungen stehen lässt
</a>

How do I construct an XPath to get ["Wie Messi einen kleinen Jungen stehen lässt"] instead of ["Wie","Messi","einen kleinen Jungen stehen lässt"] ?

I am using python lxml.html function with XPath.

Tried combinations

//a/node()/text()
//a/descendant::*/text()
//a/text()

But it didn't help. Any solutions?

I was thinking of another approach where I somehow get the "inner html" of the <a> element (which in the above case will be "Wie <em>Messi</em> einen kleinen Jungen stehen lässt" ) and remove the <em> tags from the html.

Still trying to figure out how to get innerhtml (Javascript, anyone?) from XPath.

Answer 1

XPath is a selection language, so what it can do is select nodes. If there are separate nodes in the input then you will get a list of separate nodes as the selection result.

You'll need the help of your host language - Python in this case - to do things beyond that scope (like, merging text nodes into a singe string).

You need to find all <a> elements and join their individual text descendants. That's easy enough to do:

from lxml import etree

doc = etree.parse("path/to/file")

for a in doc.xpath("//a"):
    print " ".join([t.strip() for t in a.itertext()])

prints

Wie Messi einen kleinen Jungen stehen lässt

As paul correctly points out in the comments below, you can use XPath's normalize-space() and the whole thing gets even simpler.

for a in doc.xpath("//a"):
    print a.xpath("normalize-space()")

Answer 2

If you get the string value of the <a> node instead of using text() , you will get a concatenation of the string value of all child nodes, instead of individual text nodes.

Try using simply

//a

And reading the node as a string in your host language. In Python you can use a DOM function as mentioned by @Tomalak to obtain the string value. In lxml you can use .text_content() :

tree.XPath("//a)").text_content()

Within XPath, you can use a type function:

string(//a)

lxml XPath - how to get concatenated text from node

Question

2 answers

solution1
4 2014-06-17 13:21:20

solution2
1 2014-06-17 11:29:21

lxml XPath - how to get concatenated text from node

Question

2 answers

solution1 4 2014-06-17 13:21:20

solution2 1 2014-06-17 11:29:21

solution1
4 2014-06-17 13:21:20

solution2
1 2014-06-17 11:29:21