简体   繁体   中英

Regular expression finditer: search twice on the same symbols

I need to find matches in the text and get its positions. For example, I have to find "hello hello" in the text. When the text is "hello hello world hello hello", it's ok, I get the positions 0-11 and 18-29. But when the text is "hello hello hello world", I get only one position - 0-11. But I have to find the both ones (0-11 and 6-17). I mean, I get

  1. hello hello hello world

but have to get

  1. hello hello hello world

  2. hello hello hello world

In another case I have to find the complex pattern: "hello 1,2 beautiful 2,4 world" - it means that between the words "hello" and "beautiful" could be one or two words and between the words "beautiful" and "world" 2, 3 or 4 words. And I have to find all the combinations.

This is the pattern: re.compile(u'(^|[\\[\\]\\/\\\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@])(hello)(([\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]+[a-zA-Zа-яА-Я$]+(-[a-zA-Zа-яА-Я$]+)*){1,2}[\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]*)(beautiful)(([\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]+[a-zA-Zа-яА-Я$]+(-[a-zA-Zа-яА-Я$]+)*){2,4}[\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]*)(world)($|[\\[\\]\\/\\\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@])')

And the text is "hello very beautiful beautiful very big world world". I can get the only one combination, but need to get 4:

  1. hello very beautiful beautiful very big world world

  2. hello very beautiful beautiful very big world world

  3. hello very beautiful beautiful very big world world

  4. hello very beautiful beautiful very big world world

How can I get all the combination of the matches when the matches intersect each other?

The flag re.DOTALL doesn't help.

import re

patterns = [
    u'(hello)(( [a-z]+ *){1,2})(beautiful)(( [a-z]+ *){2,4})(world)',
    u'hello hello'
]
text = u'hello hello hello world hello very beautiful beautiful very big world world'
for p in patterns:
    print p
    c = re.compile(p, flags=re.I+re.U)
    for m in c.finditer(text):
        print m.start(), m.end()

Result is

>>> (hello)(( [a-z]+ *){1,2})(beautiful)(( [a-z]+ *){2,4})(world)
>>> 24 69
(need 24 69 and 24 69 and 24 75 and 24 75 - because there are two positions of the word "beautiful")
>>> hello hello
>>> 0 11
(need 0 11 and 6 17)

The real examples of the patterns is:

u"выйдите на улицы", u"избавить.* от", u"смотрите смотрите", u"смеят.*"

And with the distance:

имени 0,3 ленина

целых 0,5 лет.*

целых 0,5 лет.* 0,1 назад

UPD

The variant u'(?=(hello hello)) helps with the patterns without distances between the words. But how can I use it in the pattern with distances, for example (hello) (?:[a-zA-Zа-яА-Я]+ ){1,2}(beautiful) (?:[a-zA-Zа-яА-Я]+ ){2,4}(world) ?

I think you can try below expression than regexp, looks not that good but might solve your problem:

Expression:

 [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]

It gives list output with positions of the pattern in string.

In [43]: string = "hello very beautiful beautiful very big world world"
In [44]: pattern='hello'
In [45]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[45]: [0]
In [46]: pattern='very'
In [47]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[47]: [6, 31]
In [48]: pattern='world'
In [49]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[49]: [40, 46]
In [50]: pattern='very big'
In [51]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[51]: [31]

Hope this helps.

Your question is still lacking a bit in clarity of what you wish to do, but I'll take a stab at it:

Regex to find repetitions without consumption:

([a-zA-Zа-яА-Я]+)(?= (\1))

Regex to find hello beautiful and world with specific numbers of words in between:

(hello) (?:[a-zA-Zа-яА-Я]+ ){1,2}(beautiful) (?:[a-zA-Zа-яА-Я]+ ){2,4}(world)

Final Update

What you wish to do is not easily done completely in regex in a single run.

Easier would be to loop and do different regexes:

for i in range(1,3):
    for j in range(2,5):
        regStr='(hello) (?:\w+ ){' + str(i) + '}(beautiful) (?:\w+ ){' + str(j) +'}(world)'

and then do a second check for duplicates using

([a-zA-Zа-яА-Я]+)(?= (\1))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM