I have a text with words separated by .
, with instances of 2 and 3 consecutive repeated words:
My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die-
I need to match them independently with regex, excluding the duplicates from the triplicates.
Since there are max. 3 consecutive repeated words, this
r'\\b(\\w+)\\.+\\1\\.+\\1\\b'
successfully catches
father.father.father
However, in order to catch 2 consecutive repeated words, I need to make sure the next and previous words aren't the same. I can do a negative look-ahead
r'\\b(\\w+)\\.+\\1(?!\\.+\\1)\\b'
but my attempts at the negative look-behind
r'(?<!(\\w)\\.)\\b\\1\\.+\\1\\b(?!\\.\\1)'
either return a fixed-width issue (when I keep the +
) or some other issue.
How should I correct the negative look-behind ?
I think that there might be an easier way to capture what you want without the negative look-behind:
r = re.compile(r'\b((\w+)\.+\2\.+\2?)\b')
r.findall(t)
> [('name.name.', 'name'), ('father.father.father', 'father')]
Just making the third repetition optional.
A version to capture any number of repetitions of the same word, can look something like this:
r = re.compile(r'\b((\w+)(\.+\2)\3*)\b')
r.findall(t)
> [('name.name', 'name', '.name'), ('father.father.father', 'father', '.father')]
Maybe regexes are not needed at all.
Using itertools.groupby
does the job. It's designed to group equal occurrences of consecutive items.
tuple
value,count only if length > 1 like this:
import itertools
s = "My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die"
matches = [(l[0],len(l)) for l in (list(v) for k,v in itertools.groupby(s.split("."))) if len(l)>1]
result:
[('name', 2), ('father', 3)]
So basically we can do whatever we want with this list of tuples (filtering it on the number of occurrences for instance)
Bonus (as I misread the question at first, so I'm leaving it in): to remove the duplicates from the sentence - group by words (after splitting according to dots) like above - take only key (value) of the values returned in a list comp (we don't need the values since we don't count) - join back with dot
In one line (still using itertools
):
new_s = ".".join([k for k,_ in itertools.groupby(s.split("."))])
result:
My.name.is.Inigo.Montoya.You.killed.my.father.Prepare.to.die
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.