[英]Python look-behind regex “fixed-width pattern” error while looking for consecutive repeated words
I have a text with words separated by .
我有一个单词分隔的文本
.
, with instances of 2 and 3 consecutive repeated words: ,连续2个和3个重复单词的实例:
My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die-
I need to match them independently with regex, excluding the duplicates from the triplicates. 我需要将它们与正则表达式独立匹配,不包括重复项的重复项。
Since there are max. 因为有最大值 3 consecutive repeated words, this
这连续3个重复的话
r'\\b(\\w+)\\.+\\1\\.+\\1\\b'
successfully catches 成功捕获
father.father.father
However, in order to catch 2 consecutive repeated words, I need to make sure the next and previous words aren't the same. 但是,为了捕获2个连续重复的单词,我需要确保下一个和前一个单词不一样。 I can do a negative look-ahead
我可以做一个负面的预测
r'\\b(\\w+)\\.+\\1(?!\\.+\\1)\\b'
but my attempts at the negative look-behind 但我的尝试是消极的后视
r'(?<!(\\w)\\.)\\b\\1\\.+\\1\\b(?!\\.\\1)'
either return a fixed-width issue (when I keep the +
) or some other issue. 要么返回一个固定宽度的问题(当我保持
+
)或其他一些问题。
How should I correct the negative look-behind ? 我应该如何纠正负面的背后 ?
I think that there might be an easier way to capture what you want without the negative look-behind: 我认为可能有一种更容易的方法来捕捉你想要的东西,而没有负面的后视:
r = re.compile(r'\b((\w+)\.+\2\.+\2?)\b')
r.findall(t)
> [('name.name.', 'name'), ('father.father.father', 'father')]
Just making the third repetition optional. 只需使第三次重复可选。
A version to capture any number of repetitions of the same word, can look something like this: 捕获同一个单词的任意数量重复的版本可能如下所示:
r = re.compile(r'\b((\w+)(\.+\2)\3*)\b')
r.findall(t)
> [('name.name', 'name', '.name'), ('father.father.father', 'father', '.father')]
Maybe regexes are not needed at all. 也许根本不需要正则表达式。
Using itertools.groupby
does the job. 使用
itertools.groupby
完成这项工作。 It's designed to group equal occurrences of consecutive items. 它旨在将相同项目的连续项目分组。
tuple
value,count only if length > 1 tuple
值,仅当长度> 1时才计数 like this: 像这样:
import itertools
s = "My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die"
matches = [(l[0],len(l)) for l in (list(v) for k,v in itertools.groupby(s.split("."))) if len(l)>1]
result: 结果:
[('name', 2), ('father', 3)]
So basically we can do whatever we want with this list of tuples (filtering it on the number of occurrences for instance) 所以基本上我们可以用这个元组列表做任何我们想做的事情(例如,根据出现次数过滤它)
Bonus (as I misread the question at first, so I'm leaving it in): to remove the duplicates from the sentence - group by words (after splitting according to dots) like above - take only key (value) of the values returned in a list comp (we don't need the values since we don't count) - join back with dot 奖金(因为我最初误读了这个问题,所以我把它留在了里面):从句子中删除重复项 - 按照上面的单词分组(按照点分割) - 只取回返回值的键(值)在列表comp中(我们不需要值,因为我们不计算) - 用dot连接回来
In one line (still using itertools
): 在一行中(仍然使用
itertools
):
new_s = ".".join([k for k,_ in itertools.groupby(s.split("."))])
result: 结果:
My.name.is.Inigo.Montoya.You.killed.my.father.Prepare.to.die
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.