[英]How to replace a list of strings in a text where some of them are substrings of other in python?
I have a text containing some words that I would like to tag, and the words to be tagged are contained in a List. 我有一个包含我想要标记的单词的文本,要标记的单词包含在List中。 The problem is that some of those words are substrings of others, but I want to tag the longest recognized string from the list. 问题是其中一些单词是其他单词的子串,但我想从列表中标记最长的识别字符串。
For example, if my text is "foo and bar are different from foo bar." 例如,如果我的文字是“foo和bar与foo bar不同”。 and my list contains "foo", "bar" and "foo bar" the result should be "[tag]foo[/tag] and [tag]bar[/tag] are different from [tag]foo bar[/tag]." 我的列表包含“foo”,“bar”和“foo bar”,结果应为“[tag] foo [/ tag]和[tag] bar [/ tag]与[tag] foo bar [/ tag]不同“。
text = "foo and bar are different from foo bar."
words = ["foo", "bar", "foo bar"]
tagged = someFunction(text, words)
What should be the code of someFunction in such a way that the value of the string taggedText is "<tag>foo</tag> and <tag>bar</tag> are different from <tag>foo bar</tag>."
someFunction的代码应该是什么,使得字符串taggedText的值为"<tag>foo</tag> and <tag>bar</tag> are different from <tag>foo bar</tag>."
? ?
If I understood your problem correctly, then this is something you are looking for :- 如果我理解你的问题,那么这就是你要找的东西: -
text = "foo and bar are different from foo bar."
words = ["foo", "bar", "foo bar"]
add_tag = lambda var : "<tag>"+var+"</tag>"
result = '' # for final string
for var in text.split():
if var in words:
tmp = add_tag(var)
else:
tmp = var
result += " "+tmp
print result
return result
Here add_tag()
method is serving what you are looking in someFunction
. 这里add_tag()
方法,服务你都看在someFunction
。
A simple way to achieve that would be to sort words
by length in the reversed order and then create a regular expression word1|word2|...
. 实现这一目标的一种简单方法是按相反的顺序按长度对words
进行排序,然后创建正则表达式word1|word2|...
Since the re engine always takes the first match, longer strings will be catched first. 由于重新引擎始终进行第一次匹配,因此将首先捕获更长的字符串。
import re
def tag_it(text, words):
return re.sub(
'|'.join(sorted(words, key=len, reverse=True)),
lambda m: '<tag>' + m.group(0) + '</tag>',
text)
text = "foo and bar are different from foo bar."
words = ["foo", "bar", "foo bar"]
print tag_it(text, words)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.