简体   繁体   English

正则表达式finditer:在相同的符号上搜索两次

[英]Regular expression finditer: search twice on the same symbols

I need to find matches in the text and get its positions. 我需要在文本中找到匹配并获得其位置。 For example, I have to find "hello hello" in the text. 例如,我必须在文本中找到“你好你好”。 When the text is "hello hello world hello hello", it's ok, I get the positions 0-11 and 18-29. 当文本是“hello hello world hello hello”时,没关系,我得到0-11和18-29的位置。 But when the text is "hello hello hello world", I get only one position - 0-11. 但是当文本是“hello hello hello world”时,我只得到一个位置 - 0-11。 But I have to find the both ones (0-11 and 6-17). 但我必须找到两者(0-11和6-17)。 I mean, I get 我的意思是,我明白了

  1. hello hello hello world 你好你好你好世界

but have to get 但必须得到

  1. hello hello hello world 你好你好你好世界

  2. hello hello hello world 你好你好你好世界

In another case I have to find the complex pattern: "hello 1,2 beautiful 2,4 world" - it means that between the words "hello" and "beautiful" could be one or two words and between the words "beautiful" and "world" 2, 3 or 4 words. 在另一个案例中,我必须找到复杂的模式:“你好1,2美丽的世界” - 这意味着在“你好”和“美丽”这两个词之间可以是一两个词,而在“美丽”和“世界”2,3或4个字。 And I have to find all the combinations. 我必须找到所有的组合。

This is the pattern: re.compile(u'(^|[\\[\\]\\/\\\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@])(hello)(([\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]+[a-zA-Zа-яА-Я$]+(-[a-zA-Zа-яА-Я$]+)*){1,2}[\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]*)(beautiful)(([\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]+[a-zA-Zа-яА-Я$]+(-[a-zA-Zа-яА-Я$]+)*){2,4}[\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]*)(world)($|[\\[\\]\\/\\\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@])') 这是模式: re.compile(u'(^|[\\[\\]\\/\\\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@])(hello)(([\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]+[a-zA-Zа-яА-Я$]+(-[a-zA-Zа-яА-Я$]+)*){1,2}[\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]*)(beautiful)(([\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]+[a-zA-Zа-яА-Я$]+(-[a-zA-Zа-яА-Я$]+)*){2,4}[\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@%]*)(world)($|[\\[\\]\\/\\\\\\^\\$\\.\\|\\?\\*\\+\\(\\)\\{\\} !<>:;,#@])')

And the text is "hello very beautiful beautiful very big world world". 而文字是“你好非常美丽的非常大的世界”。 I can get the only one combination, but need to get 4: 我可以得到唯一的组合,但需要得到4:

  1. hello very beautiful beautiful very big world world 你好非常美丽非常大的世界世界

  2. hello very beautiful beautiful very big world world 你好非常靓丽非常大的世界世界

  3. hello very beautiful beautiful very big world world 你好非常美丽非常大的世界世界

  4. hello very beautiful beautiful very big world world 你好非常靓丽非常大的世界世界

How can I get all the combination of the matches when the matches intersect each other? 当比赛相互交叉时,如何获得比赛的所有组合?

The flag re.DOTALL doesn't help. 国旗re.DOTALL没有帮助。

import re

patterns = [
    u'(hello)(( [a-z]+ *){1,2})(beautiful)(( [a-z]+ *){2,4})(world)',
    u'hello hello'
]
text = u'hello hello hello world hello very beautiful beautiful very big world world'
for p in patterns:
    print p
    c = re.compile(p, flags=re.I+re.U)
    for m in c.finditer(text):
        print m.start(), m.end()

Result is 结果是

>>> (hello)(( [a-z]+ *){1,2})(beautiful)(( [a-z]+ *){2,4})(world)
>>> 24 69
(need 24 69 and 24 69 and 24 75 and 24 75 - because there are two positions of the word "beautiful")
>>> hello hello
>>> 0 11
(need 0 11 and 6 17)

The real examples of the patterns is: 这些模式的真实例子是:

u"выйдите на улицы", u"избавить.* от", u"смотрите смотрите", u"смеят.*" u“выйдитенаулицы”,u“избавить。*от”,u“смотритесмотрите”,u“смеят。*”

And with the distance: 和距离:

имени 0,3 ленина имени0,3ленина

целых 0,5 лет.* целых0,5лет。*

целых 0,5 лет.* 0,1 назад целых0,5лет。*0,1назад

UPD UPD

The variant u'(?=(hello hello)) helps with the patterns without distances between the words. 变体u'(?=(hello hello))有助于图案之间没有距离。 But how can I use it in the pattern with distances, for example (hello) (?:[a-zA-Zа-яА-Я]+ ){1,2}(beautiful) (?:[a-zA-Zа-яА-Я]+ ){2,4}(world) ? 但是如何在距离模式中使用它,例如(hello) (?:[a-zA-Zа-яА-Я]+ ){1,2}(beautiful) (?:[a-zA-Zа-яА-Я]+ ){2,4}(world)

I think you can try below expression than regexp, looks not that good but might solve your problem: 我认为你可以尝试下面的表达而不是正则表达式,看起来不是那么好,但可能会解决你的问题:

Expression: 表达:

 [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]

It gives list output with positions of the pattern in string. 它为列表输出提供了模式在字符串中的位置。

In [43]: string = "hello very beautiful beautiful very big world world"
In [44]: pattern='hello'
In [45]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[45]: [0]
In [46]: pattern='very'
In [47]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[47]: [6, 31]
In [48]: pattern='world'
In [49]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[49]: [40, 46]
In [50]: pattern='very big'
In [51]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[51]: [31]

Hope this helps. 希望这可以帮助。

Your question is still lacking a bit in clarity of what you wish to do, but I'll take a stab at it: 你的问题仍然没有明确你想做什么,但我会抓住它:

Regex to find repetitions without consumption: 正则表达式在没有消费的情况下找到重复:

([a-zA-Zа-яА-Я]+)(?= (\1))

Regex to find hello beautiful and world with specific numbers of words in between: 正则表达式找到hello beautifulworld的特定数量的单词之间:

(hello) (?:[a-zA-Zа-яА-Я]+ ){1,2}(beautiful) (?:[a-zA-Zа-яА-Я]+ ){2,4}(world)

Final Update 最后更新

What you wish to do is not easily done completely in regex in a single run. 您想要做的事情是在一次运行中不能完全在正则表达式中完成。

Easier would be to loop and do different regexes: 更容易循环并执行不同的正则表达式:

for i in range(1,3):
    for j in range(2,5):
        regStr='(hello) (?:\w+ ){' + str(i) + '}(beautiful) (?:\w+ ){' + str(j) +'}(world)'

and then do a second check for duplicates using 然后使用再次检查重复项

([a-zA-Zа-яА-Я]+)(?= (\1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM