简体   繁体   English

寻找连续重复单词时,Python后视正则表达式“固定宽度模式”错误

[英]Python look-behind regex “fixed-width pattern” error while looking for consecutive repeated words

I have a text with words separated by . 我有一个单词分隔的文本. , with instances of 2 and 3 consecutive repeated words: ,连续2个和3个重复单词的实例:

My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die-

I need to match them independently with regex, excluding the duplicates from the triplicates. 我需要将它们与正则表达式独立匹配,不包括重复项的重复项。

Since there are max. 因为有最大值 3 consecutive repeated words, this 这连续3个重复的话

r'\\b(\\w+)\\.+\\1\\.+\\1\\b'

successfully catches 成功捕获

father.father.father

However, in order to catch 2 consecutive repeated words, I need to make sure the next and previous words aren't the same. 但是,为了捕获2个连续重复的单词,我需要确保下一个和前一个单词不一样。 I can do a negative look-ahead 我可以做一个负面的预测

r'\\b(\\w+)\\.+\\1(?!\\.+\\1)\\b'

but my attempts at the negative look-behind 但我的尝试是消极的后视

r'(?<!(\\w)\\.)\\b\\1\\.+\\1\\b(?!\\.\\1)'

either return a fixed-width issue (when I keep the + ) or some other issue. 要么返回一个固定宽度的问题(当我保持+ )或其他一些问题。

How should I correct the negative look-behind ? 我应该如何纠正负面的背后

I think that there might be an easier way to capture what you want without the negative look-behind: 我认为可能有一种更容易的方法来捕捉你想要的东西,而没有负面的后视:

r = re.compile(r'\b((\w+)\.+\2\.+\2?)\b')
r.findall(t)

> [('name.name.', 'name'), ('father.father.father', 'father')]

Just making the third repetition optional. 只需使第三次重复可选。


A version to capture any number of repetitions of the same word, can look something like this: 捕获同一个单词的任意数量重复的版本可能如下所示:

r = re.compile(r'\b((\w+)(\.+\2)\3*)\b')
r.findall(t)
> [('name.name', 'name', '.name'), ('father.father.father', 'father', '.father')]

Maybe regexes are not needed at all. 也许根本不需要正则表达式。

Using itertools.groupby does the job. 使用itertools.groupby完成这项工作。 It's designed to group equal occurrences of consecutive items. 旨在将相同项目的连续项目分组。

  • group by words (after splitting according to dots) 按字分组(根据点分割后)
  • convert to list and issue a tuple value,count only if length > 1 转换为列表并发出tuple值,仅当长度> 1时才计数

like this: 像这样:

import itertools

s = "My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die"

matches = [(l[0],len(l)) for l in (list(v) for k,v in itertools.groupby(s.split("."))) if len(l)>1]

result: 结果:

[('name', 2), ('father', 3)]

So basically we can do whatever we want with this list of tuples (filtering it on the number of occurrences for instance) 所以基本上我们可以用这个元组列表做任何我们想做的事情(例如,根据出现次数过滤它)

Bonus (as I misread the question at first, so I'm leaving it in): to remove the duplicates from the sentence - group by words (after splitting according to dots) like above - take only key (value) of the values returned in a list comp (we don't need the values since we don't count) - join back with dot 奖金(因为我最初误读了这个问题,所以我把它留在了里面):从句子中删除重复项 - 按照上面的单词分组(按照点分割) - 只取回返回值的键(值)在列表comp中(我们不需要值,因为我们不计算) - 用dot连接回来

In one line (still using itertools ): 在一行中(仍然使用itertools ):

new_s = ".".join([k for k,_ in itertools.groupby(s.split("."))])

result: 结果:

My.name.is.Inigo.Montoya.You.killed.my.father.Prepare.to.die

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python正则表达式后视需要固定宽度模式 - Python regex look-behind requires fixed-width pattern Python - 错误:look-behind需要固定宽度模式 - Python - error: look-behind requires fixed-width pattern Python 正则表达式引擎 - “后视需要固定宽度的模式”错误 - Python Regex Engine - "look-behind requires fixed-width pattern" Error 如何在 Python 中将这部分字符串与正则表达式匹配而无需后视需要固定宽度模式? - How to match this part of the string with regex in Python without getting look-behind requires fixed-width pattern? 正则表达式将唯一字符串提取到新列,出现错误“后视需要固定宽度模式” - Regex to extract unique string to new column, getting error "look-behind requires fixed-width pattern" .error: 后视需要固定宽度的模式(在加载 spacy 自定义模型时) - .error: look-behind requires fixed-width pattern (while loading spacy custom model) 正则表达式模式在不验证固定宽度模式的情况下无法使用后向查看 - Regex Pattern doesn't work using look behind without validating the fixed-width pattern 在 python 正则表达式中回顾 - look-behind in python regex python正则表达式具有不同的长度“或” - python regex with differing length “or” in look-behind 有条件的向后看(python regex),如何排除某些单词但包括某些单词? - Conditional look-behind (python regex), how to exclude certain words but include certain words?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM