正則表達式finditer：在相同的符號上搜索兩次

Question

我需要在文本中找到匹配並獲得其位置。 例如，我必須在文本中找到“你好你好”。 當文本是“hello hello world hello hello”時，沒關系，我得到0-11和18-29的位置。 但是當文本是“hello hello hello world”時，我只得到一個位置 - 0-11。 但我必須找到兩者（0-11和6-17）。 我的意思是，我明白了

你好你好你好世界

但必須得到

你好你好你好世界
你好你好你好世界

在另一個案例中，我必須找到復雜的模式：“你好1,2美麗的世界” - 這意味着在“你好”和“美麗”這兩個詞之間可以是一兩個詞，而在“美麗”和“世界”2,3或4個字。 我必須找到所有的組合。

這是模式： re.compile(u'(^|[\\[\\]\\/\\\\\\^\\$\\.\\|\\?\\*\\+\$\$\\{\\} !<>:;,#@])(hello)(([\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\$\$\\{\\} !<>:;,#@%]+[a-zA-Zа-яА-Я$]+(-[a-zA-Zа-яА-Я$]+)*){1,2}[\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\$\$\\{\\} !<>:;,#@%]*)(beautiful)(([\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\$\$\\{\\} !<>:;,#@%]+[a-zA-Zа-яА-Я$]+(-[a-zA-Zа-яА-Я$]+)*){2,4}[\\[\\]\\/\\\\^\\$\\.\\|\\?\\*\\+\$\$\\{\\} !<>:;,#@%]*)(world)($|[\\[\\]\\/\\\\\\^\\$\\.\\|\\?\\*\\+\$\$\\{\\} !<>:;,#@])')

而文字是“你好非常美麗的非常大的世界”。 我可以得到唯一的組合，但需要得到4：

你好非常美麗非常大的世界世界
你好非常靚麗非常大的世界世界
你好非常美麗非常大的世界世界
你好非常靚麗非常大的世界世界

當比賽相互交叉時，如何獲得比賽的所有組合？

國旗re.DOTALL沒有幫助。

import re

patterns = [
    u'(hello)(( [a-z]+ *){1,2})(beautiful)(( [a-z]+ *){2,4})(world)',
    u'hello hello'
]
text = u'hello hello hello world hello very beautiful beautiful very big world world'
for p in patterns:
    print p
    c = re.compile(p, flags=re.I+re.U)
    for m in c.finditer(text):
        print m.start(), m.end()

結果是

>>> (hello)(( [a-z]+ *){1,2})(beautiful)(( [a-z]+ *){2,4})(world)
>>> 24 69
(need 24 69 and 24 69 and 24 75 and 24 75 - because there are two positions of the word "beautiful")
>>> hello hello
>>> 0 11
(need 0 11 and 6 17)

這些模式的真實例子是：

u“выйдитенаулицы”，u“избавить。*от”，u“смотритесмотрите”，u“смеят。*”

和距離：

имени0,3ленина

целых0,5лет。*

целых0,5лет。*0,1назад

UPD

變體u'(?=(hello hello))有助於圖案之間沒有距離。 但是如何在距離模式中使用它，例如(hello) (?:[a-zA-Zа-яА-Я]+ ){1,2}(beautiful) (?:[a-zA-Zа-яА-Я]+ ){2,4}(world) ？

Answer 1

我認為你可以嘗試下面的表達而不是正則表達式，看起來不是那么好，但可能會解決你的問題：

表達：

 [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]

它為列表輸出提供了模式在字符串中的位置。

In [43]: string = "hello very beautiful beautiful very big world world"
In [44]: pattern='hello'
In [45]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[45]: [0]
In [46]: pattern='very'
In [47]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[47]: [6, 31]
In [48]: pattern='world'
In [49]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[49]: [40, 46]
In [50]: pattern='very big'
In [51]: [pos for pos, char in enumerate(string) if string[pos:].find(pattern) == 0]
Out[51]: [31]

希望這可以幫助。

Answer 2

你的問題仍然沒有明確你想做什么，但我會抓住它：

正則表達式在沒有消費的情況下找到重復：

([a-zA-Zа-яА-Я]+)(?= (\1))

正則表達式找到hello beautiful和world的特定數量的單詞之間：

(hello) (?:[a-zA-Zа-яА-Я]+ ){1,2}(beautiful) (?:[a-zA-Zа-яА-Я]+ ){2,4}(world)

最后更新

您想要做的事情是在一次運行中不能完全在正則表達式中完成。

更容易循環並執行不同的正則表達式：

for i in range(1,3):
    for j in range(2,5):
        regStr='(hello) (?:\w+ ){' + str(i) + '}(beautiful) (?:\w+ ){' + str(j) +'}(world)'

然后使用再次檢查重復項

([a-zA-Zа-яА-Я]+)(?= (\1))

正則表達式finditer：在相同的符號上搜索兩次

問題描述

2 個解決方案

解決方案1
0 2016-08-12 12:18:15

解決方案2
0 已采納 2016-08-12 12:18:22

最后更新

正則表達式finditer：在相同的符號上搜索兩次

問題描述

2 個解決方案

解決方案1 0 2016-08-12 12:18:15

解決方案2 0 已采納 2016-08-12 12:18:22

最后更新

解決方案1
0 2016-08-12 12:18:15

解決方案2
0 已采納 2016-08-12 12:18:22