简体   繁体   English

获取 lxml 中标签内的所有文本

[英]Get all text inside a tag in lxml

I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags.我想编写一个代码片段,该代码片段将在 lxml 中获取<content>标签内的所有文本,在下面的所有三个实例中,包括代码标签。 I've tried tostring(getchildren()) but that would miss the text in between the tags.我试过tostring(getchildren())但这会错过标签之间的文本。 I didn't have very much luck searching the API for a relevant function.我在 API 中搜索相关函数的运气并不好。 Could you help me out?你能帮我吗?

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

text_content()是否满足您的需求?

只需使用node.itertext()方法,如下所示:

 ''.join(node.itertext())

Try:尝试:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

Example:示例:

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

Produces: '\\nText outside tag <div>Text <em>inside</em> tag</div>\\n'产生: '\\nText outside tag <div>Text <em>inside</em> tag</div>\\n'

A version of albertov 's stringify-content that solves the bugs reported by hoju:解决 hoju 报告的错误的 albertov 的stringify-content版本:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

以下使用 python 生成器的代码片段运行良好并且非常高效。

''.join(node.itertext()).strip()

Defining stringify_children this way may be less complicated:以这种方式定义stringify_children可能不那么复杂:

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

or in one line或在一行

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

Rationale is the same as in this answer : leave the serialization of child nodes to lxml.基本原理与此答案相同:将子节点的序列化留给 lxml。 The tail part of node in this case isn't interesting since it is "behind" the end tag.在这种情况下, nodetail并不有趣,因为它在结束标记的“后面”。 Note that the encoding argument may be changed according to one's needs.请注意, encoding参数可能会根据需要更改。

Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:另一种可能的解决方案是序列化节点本身,然后去除开始和结束标记:

def stringify_children(node):
    s = etree.tostring(node, encoding='unicode', with_tail=False)
    return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

which is somewhat horrible.这有点可怕。 This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.仅当node没有属性时,此代码才是正确的,我认为即使到那时也没有人想要使用它。

import urllib2
from lxml import etree
url = 'some_url'

getting url获取网址

test = urllib2.urlopen(url)
page = test.read()

getting all html code within including table tag获取包含表标签的所有html代码

tree = etree.HTML(page)

xpath selector xpath 选择器

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res is the html code of table this was doing job for me. res 是表的 html 代码,这是为我做的工作。

so you can extract the tags content with xpath_text() and tags including their content using tostring()因此您可以使用 xpath_text() 提取标签内容,并使用 tostring() 提取包括其内容的标签

div = tree.xpath("//div")
div_res = etree.tostring(div)
text = tree.xpath_text("//content") 

or text = tree.xpath("//content/text()")或 text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

this last line with strip method using is not nice, but it just works使用 strip 方法的最后一行并不好,但它只是有效

One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is最简单的代码片段之一,实际上对我有用,并且根据http://lxml.de/tutorial.html#using-xpath-to-find-text上的文档是

etree.tostring(html, method="text")

where etree is a node/tag whose complete text, you are trying to read.其中 etree 是一个节点/标签,您正在尝试阅读其完整文本。 Behold that it doesn't get rid of script and style tags though.请注意,它并没有摆脱脚本和样式标签。

In response to @Richard's comment above, if you patch stringify_children to read:针对上面@Richard 的评论,如果您修补 stringify_children 以阅读:

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

it seems to avoid the duplication he refers to.似乎避免了他提到的重复。

Just a quick enhancement as the answer has been given.只是一个快速的增强,因为已经给出了答案。 If you want to clean the inside text:如果要清理内部文本:

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:我知道这是一个老问题,但这是一个常见问题,我有一个似乎比目前建议的更简单的解决方案:

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.与此问题的其他一些答案不同,此解决方案保留其中包含的所有标签,并从与其他工作解决方案不同的角度解决问题。

lxml 有一个方法:

node.text_content()

Here is a working solution.这是一个有效的解决方案。 We can get content with a parent tag and then cut the parent tag from output.我们可以使用父标签获取内容,然后从输出中剪切父标签。

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
    RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' 
    content_with_parent = etree.tostring(parent_element)    

    def _replace_html_entities(s):
        RE_ENTITY = r'&#(\d+);'

        def repl(m):
            return unichr(int(m.group(1)))

        replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

        return replaced

    if not html_entities:
        content_with_parent = _replace_html_entities(content_with_parent)

    content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

    start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

    if start_tag != end_tag:
        raise Exception('Start tag does not match to end tag while getting content with tags.')

    return content_without_parent

parent_element must have Element type. parent_element必须具有Element类型。

Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.请注意,如果您想要文本内容(不是文本中的 html 实体),请将html_entities参数保留为 False。

如果这是一个标签,您可以尝试:

node.values()
import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM