獲取 lxml 中標簽內的所有文本

Question

我想編寫一個代碼片段，該代碼片段將在 lxml 中獲取<content>標簽內的所有文本，在下面的所有三個實例中，包括代碼標簽。 我試過tostring(getchildren())但這會錯過標簽之間的文本。 我在 API 中搜索相關函數的運氣並不好。 你能幫我嗎？

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

Answer 1

text_content()是否滿足您的需求？

Answer 2

只需使用node.itertext()方法，如下所示：

 ''.join(node.itertext())

Answer 3

嘗試：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

示例：

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

產生： '\\nText outside tag <div>Text <em>inside</em> tag</div>\\n'

Answer 4

解決 hoju 報告的錯誤的 albertov 的stringify-content版本：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

Answer 5

以下使用 python 生成器的代碼片段運行良好並且非常高效。

''.join(node.itertext()).strip()

Answer 6

以這種方式定義stringify_children可能不那么復雜：

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

或在一行

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

基本原理與此答案相同：將子節點的序列化留給 lxml。 在這種情況下， node的tail並不有趣，因為它在結束標記的“后面”。 請注意， encoding參數可能會根據需要更改。

另一種可能的解決方案是序列化節點本身，然后去除開始和結束標記：

def stringify_children(node):
    s = etree.tostring(node, encoding='unicode', with_tail=False)
    return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

這有點可怕。 僅當node沒有屬性時，此代碼才是正確的，我認為即使到那時也沒有人想要使用它。

Answer 7

import urllib2
from lxml import etree
url = 'some_url'

獲取網址

test = urllib2.urlopen(url)
page = test.read()

獲取包含表標簽的所有html代碼

tree = etree.HTML(page)

xpath 選擇器

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res 是表的 html 代碼，這是為我做的工作。

因此您可以使用 xpath_text() 提取標簽內容，並使用 tostring() 提取包括其內容的標簽

div = tree.xpath("//div")
div_res = etree.tostring(div)

text = tree.xpath_text("//content")

或 text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

使用 strip 方法的最后一行並不好，但它只是有效

Answer 8

最簡單的代碼片段之一，實際上對我有用，並且根據http://lxml.de/tutorial.html#using-xpath-to-find-text上的文檔是

etree.tostring(html, method="text")

其中 etree 是一個節點/標簽，您正在嘗試閱讀其完整文本。 請注意，它並沒有擺脫腳本和樣式標簽。

Answer 9

針對上面@Richard 的評論，如果您修補 stringify_children 以閱讀：

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

似乎避免了他提到的重復。

Answer 10

只是一個快速的增強，因為已經給出了答案。 如果要清理內部文本：

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

Answer 11

我知道這是一個老問題，但這是一個常見問題，我有一個似乎比目前建議的更簡單的解決方案：

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

與此問題的其他一些答案不同，此解決方案保留其中包含的所有標簽，並從與其他工作解決方案不同的角度解決問題。

Answer 12

lxml 有一個方法：

node.text_content()

Answer 13

這是一個有效的解決方案。 我們可以使用父標簽獲取內容，然后從輸出中剪切父標簽。

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
    RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' 
    content_with_parent = etree.tostring(parent_element)    

    def _replace_html_entities(s):
        RE_ENTITY = r'&#(\d+);'

        def repl(m):
            return unichr(int(m.group(1)))

        replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

        return replaced

    if not html_entities:
        content_with_parent = _replace_html_entities(content_with_parent)

    content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

    start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

    if start_tag != end_tag:
        raise Exception('Start tag does not match to end tag while getting content with tags.')

    return content_without_parent

parent_element必須具有Element類型。

請注意，如果您想要文本內容（不是文本中的 html 實體），請將html_entities參數保留為 False。

Answer 14

如果這是一個標簽，您可以嘗試：

node.values()

Answer 15

import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

獲取 lxml 中標簽內的所有文本

問題描述

15 個解決方案

解決方案1
80 2012-08-15 03:14:52

解決方案2
78 2013-02-25 19:00:23

解決方案3
45 已采納 2011-01-07 09:35:28

解決方案4
21 2015-01-27 15:23:11

解決方案5
20 2016-06-27 11:08:55

解決方案6
6 2014-06-10 22:26:12

解決方案7
5 2012-08-19 01:14:58

解決方案8
4 2017-07-05 06:53:42

解決方案9
2 2013-04-30 16:18:44

解決方案10
2 2020-04-06 02:12:32

解決方案11
1 2015-09-08 22:22:27

解決方案12
1 2017-10-08 08:36:10

解決方案13
0 2017-08-18 17:12:56

解決方案14
-2 2012-11-14 16:30:55

解決方案15
-2 2015-01-08 00:59:19

獲取 lxml 中標簽內的所有文本

問題描述

15 個解決方案

解決方案1 80 2012-08-15 03:14:52

解決方案2 78 2013-02-25 19:00:23

解決方案3 45 已采納 2011-01-07 09:35:28

解決方案4 21 2015-01-27 15:23:11

解決方案5 20 2016-06-27 11:08:55

解決方案6 6 2014-06-10 22:26:12

解決方案7 5 2012-08-19 01:14:58

解決方案8 4 2017-07-05 06:53:42

解決方案9 2 2013-04-30 16:18:44

解決方案10 2 2020-04-06 02:12:32

解決方案11 1 2015-09-08 22:22:27

解決方案12 1 2017-10-08 08:36:10

解決方案13 0 2017-08-18 17:12:56

解決方案14 -2 2012-11-14 16:30:55

解決方案15 -2 2015-01-08 00:59:19

解決方案1
80 2012-08-15 03:14:52

解決方案2
78 2013-02-25 19:00:23

解決方案3
45 已采納 2011-01-07 09:35:28

解決方案4
21 2015-01-27 15:23:11

解決方案5
20 2016-06-27 11:08:55

解決方案6
6 2014-06-10 22:26:12

解決方案7
5 2012-08-19 01:14:58

解決方案8
4 2017-07-05 06:53:42

解決方案9
2 2013-04-30 16:18:44

解決方案10
2 2020-04-06 02:12:32

解決方案11
1 2015-09-08 22:22:27

解決方案12
1 2017-10-08 08:36:10

解決方案13
0 2017-08-18 17:12:56

解決方案14
-2 2012-11-14 16:30:55

解決方案15
-2 2015-01-08 00:59:19