lxml.etree, element.text doesn't return the entire text from an element

Question

I scrapped some html via xpath, that I then converted into an etree. Something similar to this:

<td> text1 <a> link </a> text2 </td>

but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...

Answer 1

使用element.xpath("string()")或lxml.etree.tostring(element, method="text") - 请参阅文档。

Answer 2

As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

Output is:

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2

Answer 3

looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

Answer 4

另一件看起来很好用于从文本中获取文本的东西是"".join(element.itertext())

Answer 5

<td> text1 <a> link </a> text2 </td>

Here's how it is (ignoring whitespace):

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

If you don't want a text that is inside child elements then you could collect only their tails:

text = td.text + ''.join([el.tail for el in td])

Answer 6

def get_text_recursive(node):
    return (node.text or '') + ''.join(map(get_text_recursive, node)) + (node.tail or '')

Answer 7

If the element is equal to <td> . You can do the following.

element.xpath('.//text()')

It will give you a list of all text elements from self (the meaning of the dot). // means that it will take all elements and finally text() is the function to extract text.

Answer 8

element.xpath('normalize-space()') also works.

lxml.etree, element.text doesn't return the entire text from an element

Question

8 answers

solution1
16 2011-01-23 01:56:33

solution2
6 2013-10-06 13:19:49

solution3
5 2011-09-21 13:09:35

solution4
4 2014-04-06 08:04:48

solution5
2 2013-12-08 00:49:46

solution6
1 2012-01-26 03:26:46

solution7
0 2017-05-23 18:51:37

solution8
0 2017-07-24 03:59:14

lxml.etree, element.text doesn't return the entire text from an element

Question

8 answers

solution1 16 2011-01-23 01:56:33

solution2 6 2013-10-06 13:19:49

solution3 5 2011-09-21 13:09:35

solution4 4 2014-04-06 08:04:48

solution5 2 2013-12-08 00:49:46

solution6 1 2012-01-26 03:26:46

solution7 0 2017-05-23 18:51:37

solution8 0 2017-07-24 03:59:14

solution1
16 2011-01-23 01:56:33

solution2
6 2013-10-06 13:19:49

solution3
5 2011-09-21 13:09:35

solution4
4 2014-04-06 08:04:48

solution5
2 2013-12-08 00:49:46

solution6
1 2012-01-26 03:26:46

solution7
0 2017-05-23 18:51:37

solution8
0 2017-07-24 03:59:14