如何使用 Python 從 lxml 元素中獲取原始文本

Question

我想從根元素中獲取以下內聯文本字符串。

from lxml import etree

root = root = etree.fromstring(
'''<p>
    text-first
    <span>
        Child 1
    </span>
    text-middle
    <span>
        Child 2
    </span>
    text-last
</p>''')

root.text僅返回“文本優先”，包括換行符

>>> build_text_list = etree.XPath("//text()")

>>> texts = build_text_list(root)
>>>
>>> texts
['\n    text-first\n    ', '\n        Child 1\n    ', '\n    text-middle\n    ', '\n        Child 2\n    ', '\n    text-last\n']
>>>
>>> for t in texts:
...     print t
...     print t.__dict__
...

    text-first

{'_parent': <Element p at 0x10140f638>, 'is_attribute': False, 'attrname': None, 'is_text': True, 'is_tail': False}

        Child 1

{'_parent': <Element span at 0x10140be18>, 'is_attribute': False, 'attrname': None, 'is_text': True, 'is_tail': False}

    text-middle

{'_parent': <Element span at 0x10140be18>, 'is_attribute': False, 'attrname': None, 'is_text': False, 'is_tail': True}

        Child 2

{'_parent': <Element span at 0x10140be60>, 'is_attribute': False, 'attrname': None, 'is_text': True, 'is_tail': False}

    text-last

{'_parent': <Element span at 0x10140be60>, 'is_attribute': False, 'attrname': None, 'is_text': False, 'is_tail': True}
>>>
>>> root.xpath("./p/following-sibling::text()") # following https://stackoverflow.com/a/39832753/1677041
[]

那么，我怎樣才能從中獲得text-first/middle/last的部分？

Answer 1

etree 完全有能力做到這一點：

from lxml import etree

root: etree.Element = etree.fromstring(
'''<p>
    text-first
    <span>
        Child 1
    </span>
    text-middle
    <span>
        Child 2
    </span>
    text-last
</p>''')

print(
    root.text,
    root[0].tail,
    root[1].tail,
)

所有元素都是其子元素的列表，因此這里的索引是指 2 個<span>元素。 任何元素的 tail 屬性都包含該元素之后的文本。

它當然會包含換行符，因此您可能想要 strip() 結果： root.text.strip()

Answer 2

您最初的猜測， //text()表示：select 所有文本節點，無論它們在文檔中的哪個位置。 你真正想要的 select 是文本節點，如果它們是p的直接子節點，或者如果它們不是span的子節點。

給定您顯示的輸入文檔，最准確的答案是/p/text() ：

>>> root = etree.fromstring(
'''<p>
text-first
<span>
    Child 1
</span>
text-middle
<span>
    Child 2
</span>
text-last
</p>''')

>>> etree.XPath("/p/text()")(root)
['\n    text-first\n    ', '\n    text-middle\n    ', '\n    text-last\n']

您自己的解決方案child::text()表示：select 文本節點，如果它們是當前上下文節點的子節點。 它之所以有效，是因為 XPath 表達式在這種情況下使用根元素p作為上下文進行評估。 這就是為什么簡單的text()也可以工作的原因。

>>> etree.XPath("text()")(root)
['\n    text-first\n    ', '\n    text-middle\n    ', '\n    text-last\n']

Answer 3

我的錯， xpath救了我。

>>> root.xpath('child::text()')
['\n    text-first\n    ', '\n    text-middle\n    ', '\n    text-last\n']

Answer 4

print(root.xpath('normalize-space(//*)'))

如何使用 Python 從 lxml 元素中獲取原始文本

問題描述

4 個解決方案

解決方案1
1 已采納 2019-10-11 07:24:36

解決方案2
1 2019-10-11 11:01:02

解決方案3
0 2019-10-11 07:14:59

解決方案4
0 2019-10-11 07:18:10

如何使用 Python 從 lxml 元素中獲取原始文本

問題描述

4 個解決方案

解決方案1 1 已采納 2019-10-11 07:24:36

解決方案2 1 2019-10-11 11:01:02

解決方案3 0 2019-10-11 07:14:59

解決方案4 0 2019-10-11 07:18:10

解決方案1
1 已采納 2019-10-11 07:24:36

解決方案2
1 2019-10-11 11:01:02

解決方案3
0 2019-10-11 07:14:59

解決方案4
0 2019-10-11 07:18:10