简体   繁体   English

正则表达式匹配大图案

[英]Regex for matching large pattern

I am a newbie to regex , I am trying to match some pattern but it works for fewer len pattern but it gets stuck for large pattern(looks like some Catastrophic Backtracking issue). 我是regex的新手,我正在尝试匹配某些模式,但它适用于较少的len模式,但卡在较大的模式中(看起来像一些灾难性的回溯问题)。

Below is my string, 以下是我的字符串,

world0 world1 world2 world3 world4 world5 world6 world7 world8 world9 world10
world11 world12 world13 world14 world15 world16 world17 world18 world19 world20
world21 world22 world23 world24 world25 world26 world27 world28 world29 world30
world31 world32 world33 world34 world35 world36 world37 world38 world39 world40
world41 world42 world43 world44 world45 world46 world47 world48 world49 world50
world51 world52 world53 world54 world55 world56 world57 world58 world59 world60
world61 world62 world63 world64 world65 world66 world67 world68
world69 world70 world71 world72 world73 world74 world75 world76 world77 world78
world79 world80 world81 world82 world83 world84 world85 world86 world87 world88
world89 world90 world91 world92 world93 world94 world95 world96 world97 world98
world99 world0 world1 world2 world3 world4 world5 world6 world7 world8 world9
world10 world11 world12 world13 world14 world15 world16 world17 world18 world19
world20 world21 world22 world23 world24 world25 world26 world27 world28 world29
world30 world31 world32 world33 world34 world35 world36 world37 world38 world39
world40 world41 world42 world43 world44 world45 world46 world47 world48 world49
world50 world51 world52 world53 world54 world55 world56 world57 world58 world59
world60 world61 world62 world63 world64 world65 world66 world67 world68 world69
world70 world71 world72 world73 world74 world75 world76 world77 world78 world79
world80 world81 world82 world83 world84 world85 world86 world87 world88 world89
world90 world91 world92 world93 world94 world95 world96 world97 world98 

Now my match pattern is a list of string lets say match_list, my expected output is, it should match sub-string from above that has all string defined in match_list string 现在我的匹配模式是一个字符串列表,可以说match_list,我的预期输出是,它应该与上面匹配了在match_list字符串中定义了所有字符串的子字符串匹配

Small list = ["world0","world1", "world2"]

I tried the following pattern 我尝试了以下模式

(?=((\b(?:world0|world1|world2)\b[\w\s]*?){3}))

The above one works fine and matched output is correct which I expect, 上面的一个工作正常,匹配的输出是正确的,我期望,

[0-20]  `world0 world1 world2`

[7-796] `world1 world2 world3 world4 world5 world6 world7 world8 world9 world10
world11 world12 world13 world14 world15 world16 world17 world18 world19 world20
world21 world22 world23 world24 world25 world26 world27 world28 world29 world30
world31 world32 world33 world34 world35 world36 world37 world38 world39 world40
world41 world42 world43 world44 world45 world46 world47 world48 world49 world50
world51 world52 world53 world54 world55 world56 world57 world58 world59 world60
world61 world62 world63 world64 world65 world66 world67 world68 world69 world70
world71 world72 world73 world74 world75 world76 world77 world78 world79 world80
world81 world82 world83 world84 world85 world86 world87 world88 world89 
world90 world91 world92 world93 world94 world95 world96 world97 world98 world99 world0`

 [14-803] `world2 world3 world4 world5 world6 world7 world8 world9 world10 world11
 world12 world13 world14 world15 world16 world17 world18 world19 world20 world21
 world22 world23 world24 world25 world26 world27 world28 world29 world30 world31
 world32 world33 world34 world35 world36 world37 world38 world39 world40 world41
 world42 world43 world44 world45 world46 world47 world48 world49 world50 world51
 world52 world53 world54 world55 world56 world57 world58 world59 world60 world61
 world62 world63 world64 world65 world66 world67 world68 world69 world70 world71
 world72 world73 world74 world75 world76 world77 world78 world79 world80 world81
 world82 world83 world84 world85 world86 world87 world88 world89 world90 world91
 world92 world93 world94 world95 world96 world97 world98 world99 world0 world1`

[790-810]   `world0 world1 world2`

But for large list = ['world0', 'world1', 'world2', 'world3', 'world4', 'world5', 'world6', 'world7', 'world8', 'world9', 'world10', 'world11', 'world12', 'world13', 'world14', 'world15', 'world16', 'world17', 'world18', 'world19', 'world20', 'world21', 'world22', 'world23', 'world24', 'world25', 'world26', 'world27', 'world28', 'world29', 'world30', 'world31', 'world32', 'world33', 'world34', 'world35', 'world36', 'world37', 'world38', 'world39', 'world40', 'world41', 'world42', 'world43', 'world44', 'world45', 'world46', 'world47', 'world48', 'world49'] 但对于大列表= ['world0','world1','world2','world3','world4','world5','world6','world7','world8','world9','world10', 'world11','world12','world13','world14','world15','world16','world17','world18','world19','world20','world21','world22','world23 ','world24','world25','world26','world27','world28','world29','world30','world31','world32','world33','world34','world35', 'world36','world37','world38','world39','world40','world41','world42','world43','world44','world45','world46','world47','world48 ','world49']

Tried following pattern 尝试以下模式

(?=((\b(?:world0|world1|world2|world3|world4|world5|world6|world7|world8|world9|wor ld10|world11|world12|world13|world14|world15|world16|world17|world18|world19|world20|world21|world22|world23|world24|world25|world26|world27|world28|world29|world30|world31|world32|world33|world34|world35|world36|world37|world38|world39|world40|world40|world41|world42|world43|world44|world45|world46|world47|world48|world49|world50)\b[\w\s]*?){49}))

It is throwing me a catastrophic backtracking error. 它引发了我灾难性的回溯错误。 Could you someone tell what am doing wrong or what would be the best way to do it ? 有人能告诉我做错了什么,或者做这件事的最佳方法是什么?

First thing, your pattern is wrong since it matches world0 world0 world0 . 首先,您的模式是错误的,因为它匹配world0 world0 world0

This problem can't be solved only by regex. 仅正则表达式不能解决此问题。 If I write a pattern (for the regex module ) like: 如果我为regex模块编写一个模式,例如:

word_list = ['world0', 'world1', 'world2']
p = regex.compile(r'''
    \m (\L<words>)
    \W++ (?>\w+\W+)*? (?!\g{-1})
    (\L<words>) 
    \W++ (*SKIP) (?>\w+\W+)*? (?!\g{-1}|\g{-2})
    (\L<words>) \M 
  ''', regex.VERBOSE, words=word_list)

for m in p.finditer(text, overlapped=True):
     print(m.group(0))

that searches for only three items in your example text, I obtain something complicated (and not efficient even with an optimisation effort), difficult to extend for more items and that will probably crash with more text or more items. 在您的示例文本中仅搜索三个项目,我得到了一些复杂的东西(即使进行了优化也没有效率),难以扩展更多的项目,并且可能会在包含更多文本或更多项目时崩溃。

An other possible approach consists to only search words from the list and to create text excerpts in a generator when all the words have been found: 另一种可能的方法是仅从列表中搜索单词,并在找到所有单词后在生成器中创建文本摘录:

import regex
from collections import deque

data = '''He moved on as he spoke, and the Dormouse followed him: the March Hare moved into the Dormouse’s place, and Alice rather unwillingly took the place of the March Hare. The Hatter was the only one who got any advantage from the change: and Alice was a good deal worse off than before, as the March Hare had just upset the milk-jug into his plate.
Alice did not wish to offend the Dormouse again, so she began very cautiously: `But I don’t understand. Where did they draw the treacle from?’
`You can draw water out of a water-well,’ said the Hatter; `so I should think you could draw treacle out of a treacle-well–eh, stupid?’
`But they were IN the well,’ Alice said to the Dormouse, not choosing to notice this last remark.
`Of course they were’, said the Dormouse; `–well in.’
This answer so confused poor Alice, that she let the Dormouse go on for some time without interrupting it.
`They were learning to draw,’ the Dormouse went on, yawning and rubbing its eyes, for it was getting very sleepy; `and they drew all manner of things–everything that begins with an M–‘
`Why with an M?’ said Alice.
`Why not?’ said the March Hare.
Alice was silent.
The Dormouse had closed its eyes by this time, and was going off into a doze; but, on being pinched by the Hatter, it woke up again with a little shriek, and went on: `–that begins with an M, such as mouse-traps, and the moon, and memory, and muchness– you know you say things are “much of a muchness”–did you ever see such a thing as a drawing of a muchness?’
`Really, now you ask me,’ said Alice, very much confused, `I don’t think–‘
`Then you shouldn’t talk,’ said the Hatter.'''

word_list = ('Dormouse', 'Hatter', 'Alice')

def match_gen(word_list, text):
    p = regex.compile(r'\m\L<words>\M', words=word_list)
    d = deque()
    occlist = [0]*len(word_list)   

    for m in p.finditer(text):
        windex = word_list.index(m.group(0))
        d.append((windex, m.start()))
        occlist[windex] += 1

        while not(0 in occlist):
            elt = d.popleft()
            occlist[elt[0]] -= 1
            yield [elt[1],m.end()],text[elt[1]:m.end()]

for x in match_gen(word_list, data):
    print(x)

The advantages are that there's no more risks of catastrophic backtracking and the few memory usage. 优点是,不再存在灾难性的回溯风险,并且内存使用量很少。

Note: I choose to use the regex module instead of the re module because it has more handy features like the named list, the overlapped flag or the word boundaries \\m and \\M , but you can do the same with the re module (but you need to use the (?=(...)) for overlapped matches, \\b instead of \\m and \\M , and '|'.join(word_list) to build the alternation). 注意:我选择使用regex模块而不是re模块,因为它具有更方便的功能,例如命名列表, overlapped标志或单词边界\\m\\M ,但是您可以对re模块进行相同的操作(但您需要使用(?=(...))进行重叠匹配,用\\b代替\\m\\M ,并使用'|'.join(word_list)来构建替代项)。

Note2: If your word list is too long, you can use the same way but instead of using an alternation as pattern (ie \\L<words> ), use only \\w+ and check for each match if it is in the list. 注意2:如果单词列表太长,可以使用相同的方法,但不要使用交替模式(即\\L<words> ),而仅使用\\w+并检查每个匹配项是否在列表中。 You can replace the beginning of the previous code like this: 您可以这样替换前面的代码的开头:

def match_gen(word_list, text):
    p = regex.compile(r'\w+')
    d = deque()
    occlist = [0]*len(word_list)   

    for m in filter(lambda x: x.group(0) in word_list, p.finditer(text)):

It seems for your last pattern that you want to match all worlds with numeric suffix than 50. So instead of 对于您的最后一个模式,似乎要匹配所有数字后缀大于50的世界。

(?=((\b(?:world0|world1|world2|world3|world4|world5|world6|world7|world8|world9|wor ld10|world11|world12|world13|world14|world15|world16|world17|world18|world19|world20|world21|world22|world23|world24|world25|world26|world27|world28|world29|world30|world31|world32|world33|world34|world35|world36|world37|world38|world39|world40|world40|world41|world42|world43|world44|world45|world46|world47|world48|world49|world50)\b[\w\s]*?){49}))

Why not the following (match all values from 0-49 or 50): 为什么不执行以下操作(匹配0-49或50之间的所有值):

(?=((\b(?:world([0-4][0-9]?|50))\b[\w\s]*?){3}))

And here is my attempt to clean up your regex based on your description 这是我尝试根据您的描述清理您的正则表达式

it should match sub-string from above that has all string defined in match_list string 它应该从上面匹配具有在match_list字符串中定义的所有字符串的子字符串

regex = r'\bworld([0-4][0-9]?|50)\b'
matches = re.findall(regex, "world1 world2 world50 world60")
print matches  # ['world1', 'world2', 'world50']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM