简体   繁体   English

如何替换文本中的字符串列表,其中一些字符串是python中其他字符串的子串?

[英]How to replace a list of strings in a text where some of them are substrings of other in python?

I have a text containing some words that I would like to tag, and the words to be tagged are contained in a List. 我有一个包含我想要标记的单词的文本,要标记的单词包含在List中。 The problem is that some of those words are substrings of others, but I want to tag the longest recognized string from the list. 问题是其中一些单词是其他单词的子串,但我想从列表中标记最长的识别字符串。

For example, if my text is "foo and bar are different from foo bar." 例如,如果我的文字是“foo和bar与foo bar不同”。 and my list contains "foo", "bar" and "foo bar" the result should be "[tag]foo[/tag] and [tag]bar[/tag] are different from [tag]foo bar[/tag]." 我的列表包含“foo”,“bar”和“foo bar”,结果应为“[tag] foo [/ tag]和[tag] bar [/ tag]与[tag] foo bar [/ tag]不同“。

text = "foo and bar are different from foo bar."
words = ["foo", "bar", "foo bar"]

tagged = someFunction(text, words)

What should be the code of someFunction in such a way that the value of the string taggedText is "<tag>foo</tag> and <tag>bar</tag> are different from <tag>foo bar</tag>." someFunction的代码应该是什么,使得字符串taggedText的值为"<tag>foo</tag> and <tag>bar</tag> are different from <tag>foo bar</tag>." ?

If I understood your problem correctly, then this is something you are looking for :- 如果我理解你的问题,那么这就是你要找的东西: -

text = "foo and bar are different from foo bar."
words = ["foo", "bar", "foo bar"]

add_tag = lambda var : "<tag>"+var+"</tag>"

result = ''    # for final string
for var in text.split():
    if var in words:
        tmp = add_tag(var)
    else:
        tmp = var
    result += " "+tmp

print result    
return result

Here add_tag() method is serving what you are looking in someFunction . 这里add_tag()方法,服务你都看在someFunction

A simple way to achieve that would be to sort words by length in the reversed order and then create a regular expression word1|word2|... . 实现这一目标的一种简单方法是按相反的顺序按长度对words进行排序,然后创建正则表达式word1|word2|... Since the re engine always takes the first match, longer strings will be catched first. 由于重新引擎始终进行第一次匹配,因此将首先捕获更长的字符串。

import re

def tag_it(text, words):
    return re.sub(
            '|'.join(sorted(words, key=len, reverse=True)),
            lambda m: '<tag>' + m.group(0) + '</tag>',
            text)


text = "foo and bar are different from foo bar."
words = ["foo", "bar", "foo bar"]


print tag_it(text, words)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python字符串列表中的子串列表 - Python list of substrings in list of strings 在Python中搜索并替换字符串中的一些文本? - Search and replace some text in strings in Python? Python:如何创建一个由另一个字符串列表分割的子字符串列表? - Python: How to create a list by substrings there was splitted by another list of strings? 如何在 python 中的两个其他子字符串之间替换 substring? - How to replace substring between two other substrings in python? 如何在 Pyspark 列中搜索字符串并有选择地用变量替换一些字符串(包含特定子字符串)? - How to search through strings in Pyspark column and selectively replace some strings (containing specific substrings) with a variable? 查找子字符串并替换它们但获取它们的信息 [python] - find substrings and replace them but get their information [python] Python 正则表达式替换字符串中的子字符串 - Python regex replace substrings inside strings 从字符串列表中提取子字符串,其中子字符串由一致的字符限定 - Extract substrings from a list of strings, where substrings are bounded by consistent characters 过滤字符串列表,忽略其他项的子字符串 - Filter list of strings, ignoring substrings of other items 如何在 Python 中的字符串列表末尾删除多个子字符串? - How to remove multiple substrings at the end of a list of strings in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM