如何使用 ElementTree 获取元素的完整 XML 或 HTML 内容？

Question

That is, all text and subtags, without the tag of an element itself?也就是说，所有的文本和子标签，没有元素本身的标签？

Having拥有

<p>blah <b>bleh</b> blih</p>

I want我要

blah <b>bleh</b> blih

element.text returns "blah " and etree.tostring(element) returns: element.text 返回 "blah " 和 etree.tostring(element) 返回：

<p>blah <b>bleh</b> blih</p>

Answer 1

ElementTree works perfectly, you have to assemble the answer yourself. ElementTree 完美运行，您必须自己组装答案。 Something like this...像这样的东西...

"".join( [ "" if t.text is None else t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )

Thanks to JV amd PEZ for pointing out the errors.感谢 JV amd PEZ 指出错误。

Edit.编辑。

>>> import xml.etree.ElementTree as xml
>>> s= '<p>blah <b>bleh</b> blih</p>\n'
>>> t=xml.fromstring(s)
>>> "".join( [ t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )
'blah <b>bleh</b> blih'
>>>

Tail not needed.不需要尾巴。

Answer 2

This is the solution I ended up using:这是我最终使用的解决方案：

def element_to_string(element):
    s = element.text or ""
    for sub_element in element:
        s += etree.tostring(sub_element)
    s += element.tail
    return s

Answer 3

These are good answers, which answer the OP's question, particularly if the question is confined to HTML.这些是很好的答案，可以回答 OP 的问题，特别是如果问题仅限于 HTML。 But documents are inherently messy, and the depth of element nesting is usually impossible to predict.但是文档本质上是杂乱无章的，元素嵌套的深度通常是无法预测的。

To simulate DOM's getTextContent() you would have to use a (very) simple recursive mechanism.要模拟 DOM 的 getTextContent()，您必须使用（非常）简单的递归机制。

To get just the bare text:只获取裸文本：

def get_deep_text( element ):
    text = element.text or ''
    for subelement in element:
        text += get_deep_text( subelement )
    text += element.tail or ''
    return text
print( get_deep_text( element_of_interest ))

To get all the details about the boundaries between raw text:要获取有关原始文本之间边界的所有详细信息：

root_el_of_interest.element_count = 0
def get_deep_text_w_boundaries( element, depth = 0 ):
    root_el_of_interest.element_count += 1
    element_no = root_el_of_interest.element_count 
    indent = depth * '  '
    text1 = '%s(el %d - attribs: %s)\n' % ( indent, element_no, element.attrib, )
    text1 += '%s(el %d - text: |%s|)' % ( indent, element_no, element.text or '', )
    print( text1 )
    for subelement in element:
        get_deep_text_w_boundaries( subelement, depth + 1 )
    text2 = '%s(el %d - tail: |%s|)' % ( indent, element_no, element.tail or '', )
    print( text2 )
get_deep_text_w_boundaries( root_el_of_interest )

Example output from single para in LibreOffice Writer doc (.fodt file): LibreOffice Writer 文档（.fodt 文件）中单个段的示例输出：

(el 1 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'Standard'})
(el 1 - text: |Ci-après individuellement la "|)
  (el 2 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'T5'})
  (el 2 - text: |Partie|)
  (el 2 - tail: |" et ensemble les "|)
  (el 3 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'T5'})
  (el 3 - text: |Parties|)
  (el 3 - tail: |", |)
(el 1 - tail: |
   |)

One of the points about messiness is that there is no hard and fast rule about when a text style indicates a word boundary and when it doesnt: superscript immediately following a word (with no white space) means a separate word in all use cases I can imagine.关于混乱的一点是，关于文本样式何时指示词边界以及何时不指示没有硬性规定：紧跟在一个词之后的上标（没有空格）意味着在所有用例中都是一个单独的词我可以想象一下。 OTOH sometimes you might find, for example, a document where the first letter is either bolded for some reason, or perhaps uses a different style for the first letter to represent it as upper case, rather than simply using the normal UC character. OTOH 有时您可能会发现，例如，由于某种原因第一个字母被加粗的文档，或者可能使用不同样式的第一个字母将其表示为大写，而不是简单地使用普通的 UC 字符。

And of course the less primarily "English-centric" this discussion gets the greater the subtleties and complexities!当然，这种讨论越不以“以英语为中心”，其微妙之处和复杂性就越大！

Answer 4

I doubt ElementTree is the thing to use for this.我怀疑 ElementTree 是否适合用于此目的。 But assuming you have strong reasons for using it maybe you could try stripping the root tag from the fragment:但是假设您有充分的理由使用它，也许您可以尝试从片段中剥离根标记：

 re.sub(r'(^<%s\b.*?>|</%s\b.*?>$)' % (element.tag, element.tag), '', ElementTree.tostring(element))

Answer 5

Most of the answers here are based on the XML parser ElementTree , even PEZ's regex-based answer still partially relies on ElementTree.这里的大多数答案都基于 XML 解析器ElementTree ，即使是PEZ 的基于正则表达式的答案仍然部分依赖于 ElementTree。

All those are good and suitable for most use cases but, just for the sake of completeness, it is worth noting that, ElementTree.tostring(...) will give you an equivalent snippet, but not always identical to the original payload.所有这些都很好并且适用于大多数用例，但是，为了完整性，值得注意的是， ElementTree.tostring(...)将为您提供等效的代码段，但并不总是与原始有效负载相同。 If, for some very rare reason, that you want to extract the content as-is, you have to use a pure regex-based solution.如果出于某种非常罕见的原因，您想按原样提取内容，则必须使用基于正则表达式的纯解决方案。 This example is how I use regex-based solution. 这个例子是我如何使用基于正则表达式的解决方案。

Answer 6

This answer is slightly modified of Pupeno's reply.这个答案稍微修改了Pupeno 的回复。 Here I added encoding type into "tostring".在这里，我将编码类型添加到“tostring”中。 This issue took many hours of mine.这个问题花了我很多小时。 I hope this small correction will help others.我希望这个小小的更正能帮助其他人。

def element_to_string(element):
        s = element.text or ""
        for sub_element in element:
            s += ElementTree.tostring(sub_element, encoding='unicode')
        s += element.tail
        return s

Answer 7

不知道是否可以选择外部库，但无论如何 - 假设页面上有一个带有此文本的<p> ，jQuery 解决方案将是：

alert($('p').html()); // returns blah <b>bleh</b> blih

如何使用 ElementTree 获取元素的完整 XML 或 HTML 内容？

问题描述

7 个解决方案

解决方案1
11 2008-12-19 11:21:52

解决方案2
8 已采纳 2008-12-19 17:27:09

解决方案3
3 2015-12-04 09:29:26

解决方案4
2 2008-12-19 11:56:30

解决方案5
1 2018-02-21 01:32:18

解决方案6
0 2020-07-21 00:06:21

解决方案7
-4 2008-12-19 11:23:59

如何使用 ElementTree 获取元素的完整 XML 或 HTML 内容？

问题描述

7 个解决方案

解决方案1 11 2008-12-19 11:21:52

解决方案2 8 已采纳 2008-12-19 17:27:09

解决方案3 3 2015-12-04 09:29:26

解决方案4 2 2008-12-19 11:56:30

解决方案5 1 2018-02-21 01:32:18

解决方案6 0 2020-07-21 00:06:21

解决方案7 -4 2008-12-19 11:23:59

解决方案1
11 2008-12-19 11:21:52

解决方案2
8 已采纳 2008-12-19 17:27:09

解决方案3
3 2015-12-04 09:29:26

解决方案4
2 2008-12-19 11:56:30

解决方案5
1 2018-02-21 01:32:18

解决方案6
0 2020-07-21 00:06:21

解决方案7
-4 2008-12-19 11:23:59