简体   繁体   English

如何在python中找到所有可能的正则表达式匹配?

[英]How to find all possible regex matches in python?

I am trying to find all possible word/tag pairs or other nested combinations with python and its regular expressions. 我试图找到所有可能的单词/标签对或其他嵌套组合与python及其正则表达式。

sent = '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))'

def checkBinary(sentence):
    n = re.findall("\([A-Za-z-0-9\s\)\(]*\)", sentence)
    print(n)

checkBinary(sent)

Output:
['(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))']

looking for: 寻找:

['(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))', 
 '(NNP Hoi)', 
 '(NN Hallo)',
 '(NN Hey)', 
 '(NNP (NN Ciao) (NN Adios))',
 '(NN Ciao)',
 '(NN Adios)']

I think the regex formula could find the nested parenthesis word/tag pairs aswell but it doesn't return them. 我认为正则表达式公式可以找到嵌套的括号词/标签对,但它不会返回它们。 How should I do this? 我该怎么做?

it's actually not possible to do this by using regular expressions, because regular expressions express a language defined by a regular grammar that can be solved by a non finite deterministic automaton, where matching is represented by states ; 实际上不可能通过使用正则表达式来实现这一点,因为正则表达式表达由常规语法定义的语言,可以通过非有限确定性自动机来解决,其中匹配由状态表示; then to match nested parenthesis, you'd need to be able to match an infinite number of parenthesis and then have an automaton with an infinite number of states. 然后,为了匹配嵌套的括号,你需要能够匹配无数个括号,然后有一个自动机具有无限数量的状态。

To be able to cope with that, we use what's called a push-down automaton, that is used to define the context free grammar. 为了能够应对这种情况,我们使用所谓的下推自动机,用于定义无上下文语法。

乔姆斯基的等级制度

So if your regex does not match nested parenthesis, it's because it's expressing the following automaton and does not match anything on your input: 因此,如果你的正则表达式与嵌套的括号不匹配,那是因为它表示以下自动机并且与输入中的任何内容都不匹配:

正则表达式可视化

Play with it 玩它

As a reference, please have a look at MIT's courses on the topic: 作为参考,请查看麻省理工学院关于该主题的课程:

So one of the ways to parse your string efficiently, is to build a grammar for nested parenthesis ( pip install pyparsing first): 因此,有效解析字符串的方法之一是为嵌套括号构建语法(首先是pip install pyparsing ):

>>> import pyparsing
>>> strings = pyparsing.Word(pyparsing.alphanums)
>>> parens  = pyparsing.nestedExpr( '(', ')', content=strings)
>>> parens.parseString('(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))').asList()
[['NP', ['NNP', 'Hoi'], ['NN', 'Hallo'], ['NN', 'Hey'], ['NNP', ['NN', 'Ciao'], ['NN', 'Adios']]]]

NB: there exists a few regular expressions engines that do implement nested parenthesis matching using the push down. 注意:存在一些使用下推实现嵌套括号匹配的正则表达式引擎。 The default python re engine is not one of them, but an alternative engine exists, called regex ( pip install regex ) that can do recursive matching (which makes the re engine context free), cf this code snippet : 默认的Python re引擎是不是其中之一,但替代的发动机存在,所谓的regexpip install regex ),可以做递归匹配(这使得免费重新引擎上下文),比照此代码段

>>> import regex
>>> res = regex.search(r'(?<rec>\((?:[^()]++|(?&rec))*\))', '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))')
>>> res.captures('rec')
['(NNP Hoi)', '(NN Hallo)', '(NN Hey)', '(NN Ciao)', '(NN Adios)', '(NNP (NN Ciao) (NN Adios))', '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))']

Regular expressions used in modern languages DO NOT represent regular languages. 现代语言中使用的正则表达式不代表常规语言。 zmo is right in saying that regular languages in Language Theroy are represented by finite state automata but the regular expressions that use any sort of backtracking like those with capturing groups, lookarounds and etc that are used in modern languages CANNOT be represented by FSAs known in Language Theory. zmo说得对,语言Theroy中的常规语言由有限状态自动机表示,但使用任何类型的回溯的正则表达式,如现代语言中使用的捕获组,外观等等,不能用语言中已知的FSA表示理论。 How can you represent a pattern like (\\w+)\\1 with a DFA or even and NFA? 如何使用DFA甚至NFA表示类似(\\ w +)\\ 1的模式?

The regular expression you are looking for can be something like this(only matches to two levels): 您正在寻找的正则表达式可能是这样的(只匹配两个级别):

(?=(\((?:[^\)\(]*\([^\)]*\)|[^\)\(])*?\)))

I tested this on http://regexhero.net/tester/ 我在http://regexhero.net/tester/上测试了这个

The matches are in the captured groups: 匹配在捕获的组中:

1: (NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)) 1:(NP(NNP Hoi)(NN Hallo)(NN嘿)(NNP(NN Ciao)(NN Adios))

1: (NNP Hoi) 1:(NNP Hoi)

1: (NN Hallo) 1:(NN Hallo)

1: (NN Hey) 1:(NN嘿)

1: (NNP (NN Ciao) (NN Adios)) 1:(NNP(NN Ciao)(NN Adios))

1: (NN Ciao) 1:(NN Ciao)

1: (NN Adios) 1:(NN Adios)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM