简体   繁体   English

使用 lxml 标记文本的一部分

[英]Using lxml to tag parts of a text

I am working with XML using the python lxml library.我正在使用 python lxml 库处理 XML。

I have a paragraph of text like so,我有一段这样的文字,

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer facilisis elit eget
condimentum efficitur. Donec eu dignissim lectus. Integer tortor
lacus, porttitor at ipsum quis, tempus dignissim dui. Curabitur cursus
quis arcu in pellentesque. Aenean volutpat, tortor a commodo interdum,
lorem est convallis dui, sodales imperdiet ligula ligula non felis.</p>

Suppose I would want to tag to a specific bit of text like " tortor lacus, porttitor at ipsum quis, tempus " that exists inside the paragraph above, with the tag .假设我想用标签标记到上面段落中存在的特定文本位,例如“ tortor lacus, porttitor at ipsum quis, tempus ”。 How would I go about doing this with lxml.我将如何使用 lxml 执行此操作。 Right now I'm using text replace, but I feel that isn't the right way to go about this.现在我正在使用文本替换,但我觉得这不是解决这个问题的正确方法。

ie the result I am looking for would be即我正在寻找的结果是

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer facilisis elit eget
condimentum efficitur. Donec eu dignissim lectus. Integer <foobar>tortor
lacus, porttitor at ipsum quis, tempus</foobar> dignissim dui. Curabitur cursus 
quis arcu in pellentesque. Aenean volutpat, tortor a commodo interdum,
lorem est convallis dui, sodales imperdiet ligula ligula non felis.</p>

Replacing text with an actual element is tricky in lxml;在 lxml 中用实际元素替换文本很棘手; especially if you have mixed content (mix of text and child elements).特别是如果您有混合内容(文本和子元素的混合)。

The tricky part is knowing what to do with the remaining text and where to insert the element.棘手的部分是知道如何处理剩余的文本以及在哪里插入元素。 Should the remaining text be part of the parent .text?剩余的文本应该是父 .text 的一部分吗? Should it be part of the .tail of the preceding sibling?它应该是前一个兄弟的 .tail 的一部分吗? Should it be part of the new element's .tail?它应该是新元素的.tail 的一部分吗?

What I've done in the past is to process all of the text() nodes and add placeholder strings to the text (whether that's .text or .tail).我过去所做的是处理所有 text() 节点并将占位符字符串添加到文本中(无论是 .text 还是 .tail)。 I then serialize the tree to a string and do a search and replace on the placeholders.然后我将树序列化为一个字符串,并在占位符上进行搜索和替换。 After that I either parse the string as XML to build a new tree (for further processing, validation, analysis, etc.) or write it to a file.之后,我将字符串解析为 XML 以构建新树(用于进一步处理、验证、分析等)或将其写入文件。

Please see my related question /answer for additional info on .text/.tail in this context.在这种情况下,请参阅我的相关问题/答案以获取有关 .text/.tail 的其他信息。

Here's an example based on my answer in the question above.这是基于我在上述问题中的回答的示例。

Notes:笔记:

  • I added gotcha elements to show how it handles mixed content.我添加了gotcha元素来展示它如何处理混合内容。
  • I added a second search string ( Aenean volutpat ) to show replacing more than one string.我添加了第二个搜索字符串 ( Aenean volutpat ) 以显示替换多个字符串。
  • In this example, I'm only processing text() nodes that are children of p .在这个例子中,我只处理作为p子节点的 text() 节点。

Python Python

import re
from lxml import etree

xml = """<doc>
<p>Lorem ipsum dolor <gotcha>sit amet</gotcha>, consectetur adipiscing elit. Integer facilisis elit eget
condimentum efficitur. Donec eu dignissim lectus. Integer tortor
lacus, porttitor at ipsum quis, tempus dignissim dui. Curabitur cursus
quis arcu <gotcha>in pellentesque</gotcha>. Aenean volutpat, tortor a commodo interdum,
lorem est convallis dui, sodales imperdiet ligula ligula non felis.</p>
</doc>
"""


def update_text(orig_text, phrase_list, elemname):
    new_text = orig_text
    for phrase in phrase_list:
        if phrase in new_text:
            # Add placeholders for the new start/end tags.
            new_text = new_text.replace(phrase, f"[elemstart:{elemname}]{phrase}[elemend:{elemname}]")
        else:
            new_text = new_text
    return new_text


root = etree.fromstring(xml)

foobar_phrases = {"tortor lacus, porttitor at ipsum quis, tempus", "Aenean volutpat"}

for text in root.xpath("//p/text()"):
    parent = text.getparent()
    updated_text = update_text(text.replace("\n", " "), foobar_phrases, "foobar")
    if text.is_text:
        parent.text = updated_text
    elif text.is_tail:
        parent.tail = updated_text

# Serialze the tree to a string so we can replace the placeholders with proper tags.
serialized_tree = etree.tostring(root, encoding="utf-8").decode()
serialized_tree = re.sub(r"\[elemstart:([^\]]+)\]", r"<\1>", serialized_tree)
serialized_tree = re.sub(r"\[elemend:([^\]]+)\]", r"</\1>", serialized_tree)

# Now we can either parse the string back into a tree (for additional processing, validation, etc.),
# print it, write it to a file, etc.
print(serialized_tree)

Printed Output (line breaks added for readability)打印输出(添加换行符以提高可读性)

<doc>
<p>Lorem ipsum dolor <gotcha>sit amet</gotcha>, consectetur adipiscing elit. 
Integer facilisis elit eget condimentum efficitur. Donec eu dignissim lectus.
Integer <foobar>tortor lacus, porttitor at ipsum quis, tempus</foobar> dignissim dui.
Curabitur cursus quis arcu <gotcha>in pellentesque</gotcha>. <foobar>Aenean volutpat</foobar>, 
tortor a commodo interdum, lorem est convallis dui, sodales imperdiet ligula ligula non felis.</p>
</doc>

You can check like this if there are any children:如果有孩子,你可以这样检查:

from lxml import etree

root = etree.parse("test.xml").getroot()
paragraphs = root.findall("p")

print(f"Found {len(paragraphs)} paragraphs")

for i in range(len(paragraphs)):
    if len(list(paragraphs[i])) > 0:
        print(f"Paragraph {i} has children")
    else:
        print(f"Paragraph {i} has no children")

First the code filters all paragraphs, and than looks if the paragraph has children.首先,代码过滤所有段落,然后查看该段落是否有子段落。

Now if you have no children you can just replace the text like before and if you have children you can replace the whole child现在如果你没有孩子,你可以像以前一样替换文本,如果你有孩子,你可以替换整个孩子

If <p> tag won't be nested inside another <p> , You may consider regex replace如果<p>标签不会嵌套在另一个<p> ,您可以考虑使用正则表达式替换

import re

a="""
other lines here that may contain foo
<p>
this is a foo inside para
and this is new line in this foo para
</p>
excess lines here that also may contain foo in it.
"""

search="foo"
newtagname="bar"

b=re.sub("("+search+")(?=[^><]*?</p>)","<"+newtagname+">\\1</"+newtagname+">",a)

print(b)

This prints这打印

other lines here that may contain foo
<p>
this is a <bar>foo</bar> inside para
and this is new line in this <bar>foo</bar> para
</p>
excess lines here that also may contain foo in it.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM