简体   繁体   English

为给定的正则表达式创建所有可能匹配的集合

[英]Create set of all possible matches for a given regex

I'm wondering how to find a set of all matches to a given regex with a finite number of matches. 我想知道如何找到一组具有有限数量匹配的给定正则表达式的所有匹配。

For example: 例如:

All of these example you can assume they start with ^ and end with $ 所有这些示例都可以假设它们以^开头并以$结尾

`hello?` -> (hell, hello)
`[1-9][0-9]{0,3}` -> (1,2,3 ..., 9998, 9999)
`My (cat|dog) is awesome!` -> (My cat is awesome!, My dog is awesome!)
`1{1,10}` -> (1,11, ..., 111111111, 1111111111)
`1*` -> //error
`1+` -> //error
`(1|11){2}` -> (1,11,111,1111) //notice how it doesn't repeat any of the possibilities

I'd also be interested if there was a way of retrieving count a unique a solutions to the regex or if there is a way to determine if the regex has a finite solutions. 如果有一种方法可以检索计算正则表达式的唯一解,或者是否有办法确定正则表达式是否具有有限解,那么我也会感兴趣。

It would be nice if the algorithm could parse any regex, but a powerful enough subset of the regex would be fine. 如果算法可以解析任何正则表达式会很好,但正则表达式的强大的子集将是好的。

I'm interested in a PHP solution to this problem, but other languages would also be fine. 我对这个问题的PHP解决方案感兴趣,但其他语言也没问题。

EDIT: 编辑:

I've learned in my Formal Theory class about DFA which can be used to implement regex (and other regular languages). 我在我的Formal Theory课程中学到了可以用来实现正则表达式(以及其他常规语言)的DFA If I could transform the regex into a DFA the solution seems fairly straight forward to me, but that transformation seems rather tricky to me. 如果我可以将正则表达式转换为DFA,那么解决方案对我来说似乎相当直接,但这种转变对我来说似乎相当棘手。

EDIT 2: 编辑2:

Thanks for all the suggestions, see my post about the public github project I'm working on to "answer" this question. 感谢所有的建议, 请参阅我关于公共github项目的帖子,我正在努力“回答”这个问题。

The transformation from a regex to a DFA is pretty straightforward. 从正则表达式到DFA的转换非常简单。 The issue you'll run into there, though, is that the DFA generated can contain loops (eg, for * or + ), which will make it impossible to expand fully. 但是,您将遇到的问题是,生成的DFA可以包含循环(例如,用于*+ ),这将使其无法完全展开。 Additionally, {n,n} can't be represented cleanly in a DFA, as a DFA has no "memory" of how many times it's looped. 此外, {n,n}无法在DFA中干净地表示,因为DFA没有“记忆”它循环的次数。

What a solution to this problem will boil down to is building a function which tokenizes and parses a regular expression, then returns an array of all possible matches. 这个问题的解决方案将归结为构建一个标记和解析正则表达式的函数,然后返回所有可能匹配的数组。 Using recursion here will help you a lot . 在这里使用递归将帮助你很多

A starting point, in pseudocode, might look like: 伪代码的起点可能如下所示:

to GenerateSolutionsFor(regex):
    solutions = [""]
    for token in TokenizeRegex(regex):
        if token.isConstantString:
            for sol in solutions: sol.append(token.string)
        else if token.isLeftParen:
            subregex = get content until matching right paren
            subsols = GenerateSolutionsFor(subregex)
            for sol in solutions:
                for subsol in subsols:
                    sol.append(subsol)
        else if token.isVerticalBar:
            solutions.add(GenerateSolutionsFor(rest of the regex))
        else if token.isLeftBrace:
            ...

I'm wondering how to find a set of all matches to a given regex with a finite number of matches. 我想知道如何找到一组具有有限数量匹配的给定正则表达式的所有匹配。

Because you're only considering regular expressions denoting finite languages, you're actually considering a subset of the regular expressions over an alphabet. 因为您只考虑表示有限语言的正则表达式,所以您实际上正在考虑字母表上的正则表达式的子集。 In particular, you're not dealing with regular expressions constructed using the Kleene star operator. 特别是,您没有处理使用Kleene星运算符构造的正则表达式。 This suggests a simple recursive algorithm for constructing the set of strings denoted by the regular expressions without Kleene star over an alphabet Σ. 这表明了一种简单的递归算法,用于构造由正则表达式表示的字符串集,而不是字母表Σ上的Kleene星。

LANG(a)     = {a} for all a ∈ Σ
LANG(x ∪ y) = LANG(x) ∪ LANG(y)
LANG(xy)    = {vw : v ∈ LANG(x) ∧ w ∈ LANG(y)}

Consider a regular expression such as a(b ∪ c)d . 考虑一个正则表达式,如a(b ∪ c)d This is precisely the structure of your cats and dogs example. 这正是你的猫狗结构的例子。 Executing the algorithm will correctly determine the language denoted by the regular expression: 执行算法将正确确定正则表达式表示的语言:

LANG(a((b ∪ c)d)) = {xy : x ∈ LANG(a) ∧ y ∈ LANG((b ∪ c)d)}
                  = {xy : x ∈ {a} ∧ y ∈ {vw : v ∈ LANG(b ∪ c) ∧ w ∈ LANG{d}}}
                  = {ay : y ∈ {vw : v ∈ (LANG(b) ∪ LANG(c)) ∧ w ∈ {d}}}
                  = {ay : y ∈ {vd : v ∈ {b} ∪ {c}}
                  = {ay : y ∈ {vd : v ∈ {b,c}}}
                  = {ay : y ∈ {bd, cd}}
                  = {abd, acd}

You also ask whether there is an algorithm that determines whether a regular language is finite. 您还会问是否有一种算法可以确定常规语言是否有限。 The algorithm consists in constructing the deterministic finite automaton accepting the language, then determining whether the transition graph contains a walk from the start state to a final state containing a cycle. 该算法在于构造接受语言的确定性有限自动机,然后确定转换图是否包含从开始状态到包含循环的最终状态的步行。 Note that the subset of regular expressions constructed without Kleene star denote finite languages. 请注意,在没有Kleene星的情况下构造的正则表达式子集表示有限语言。 Because the union and concatenation of finite sets is finite, this follows by easy induction. 由于有限集的并集和连接是有限的,因此易于归纳。

This probably doesn't answer all your questions / needs, but maybe it's a good starting point. 这可能无法满足您的所有问题/需求,但也许这是一个很好的起点。 I was searching for a solution for auto-generating data that matches a regexp a while ago, and i found this perl module Parse::RandGen, Parse::RandGen::RegExp, which worked quite impressivly good for my needs: 我正在寻找一个自动生成数据的解决方案,与前一段时间的正则表达式相匹配,我发现这个perl模块Parse :: RandGen,Parse :: RandGen :: RegExp,它对我的​​需求非常有效:

http://metacpan.org/pod/Parse::RandGen http://metacpan.org/pod/Parse::RandGen

您可能希望查看此Regex库,它解析RegEx语法(尽管与perl标准略有不同)并可以从中构建DFA: http//www.brics.dk/automaton/

I have begun working on a solution on Github . 我已经开始在Github上开发解决方案了 It can already lex most examples and give the solution set for finite regex. 它已经可以解释大多数示例并为有限正则表达式提供解决方案集。

It currently passes the following unit tests. 它目前通过以下单元测试。

<?php

class RegexCompiler_Tests_MatchTest extends PHPUnit_Framework_TestCase
{

    function dataProviderForTestSimpleRead()
    {
        return array(
            array( "^ab$", array( "ab" ) ),
            array( "^(ab)$", array( "ab" ) ),
            array( "^(ab|ba)$", array( "ab", "ba" ) ),
            array( "^(ab|(b|c)a)$", array( "ab", "ba", "ca" ) ),
            array( "^(ab|ba){0,2}$", array( "", "ab", "ba", "abab", "abba", "baab", "baba" ) ),
            array( "^(ab|ba){1,2}$", array( "ab", "ba", "abab", "abba", "baab", "baba" ) ),
            array( "^(ab|ba){2}$", array( "abab", "abba", "baab", "baba" ) ),
            array( "^hello?$", array( "hell", "hello" ) ),
            array( "^(0|1){3}$", array( "000", "001", "010", "011", "100", "101", "110", "111" ) ),
            array( "^[1-9][0-9]{0,1}$", array_map( function( $input ) { return (string)$input; }, range( 1, 99 ) ) ),
            array( '^\n$', array( "\n" ) ),
            array( '^\r$', array( "\r" ) ),
            array( '^\t$', array( "\t" ) ),
            array( '^[\\\\\\]a\\-]$', array( "\\", "]", "a", "-" ) ), //the regex is actually '^[\\\]a\-]$' after PHP string parsing
            array( '^[\\n-\\r]$', array( chr( 10 ), chr( 11 ), chr( 12 ), chr( 13 ) ) ),
        );
    }

    /**
     * @dataProvider dataProviderForTestSimpleRead
     */

    function testSimpleRead( $regex_string, $expected_matches_array )
    {
        $lexer = new RegexCompiler_Lexer();
        $actualy_matches_array = $lexer->lex( $regex_string )->getMatches();
        sort( $actualy_matches_array );
        sort( $expected_matches_array );
        $this->assertSame( $expected_matches_array, $actualy_matches_array );
    }

}

?>

I would like to build an MatchIterator class that could handle infinite lists as well as one that would randomly generate matches from the regex. 我想构建一个可以处理无限列表的MatchIterator类,以及一个可以从正则表达式中随机生成匹配的列表。 I'd also like to look into building regex from a match set as a way of optimizing lookups or compressing data. 我还想从匹配集中构建正则表达式,作为优化查找或压缩数据的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM