简体   繁体   English

如何使用正则表达式评估约束? (PHP,正则表达式)

[英]How to evaluate constraints using regular expressions? (php, regex)

So, let's say I want to accept strings as follows所以,假设我想接受如下字符串
SomeColumn IN||<||>||= [123, 'hello', "wassup"]||123||'hello'||"yay!"
For example: MyValue IN ['value', 123] or MyInt > 123 -> I think you get the idea.例如: MyValue IN ['value', 123]MyInt > 123 -> 我想你明白了。 Now, what's bothering me is how to phrase this in a regex?现在,困扰我的是如何在正则表达式中表达这个? I'm using PHP, and this is what I'm doing right now:我正在使用 PHP,这就是我现在正在做的事情:

        $temp = explode(';', $constraints);
        $matches = array();
        foreach ($temp as $condition) {
            preg_match('/(.+)[\t| ]+(IN|<|=|>|!)[\t| ]+([0-9]+|[.+]|.+)/', $condition, $matches[]);
        }
        foreach ($matches as $match) {
            if ($match[2] == 'IN') {
                preg_match('/(?:([0-9]+|".+"|\'.+\'))/', substr($match[3], 1, -1), $tempm);
                print_r($tempm);
            }
        }
Really appreciate any help right there, my regex'ing is horrible. 真的很感谢那里的任何帮助,我的正则表达式很糟糕。

I assume your input looks similar to this:我假设您的输入看起来与此类似:

$string = 'SomeColumn IN [123, \'hello\', "wassup"];SomeColumn < 123;SomeColumn = \'hello\';SomeColumn > 123;SomeColumn = "yay!";SomeColumn = [123, \'hello\', "wassup"]';

If you use preg_match_all there is no need for explode or to build the matches yourself.如果您使用preg_match_all则不需要explode或自己构建匹配。 Note that the resulting two-dimensional array will have its dimensions switched, but that is often desirable.请注意,生成的二维数组将切换维度,但这通常是可取的。 Here is the code:这是代码:

preg_match_all('/(\w+)[\t ]+(IN|<|>|=|!)[\t ]+((\'[^\']*\'|"[^"]*"|\d+)|\[[\t ]*(?4)(?:[\t ]*,[\t ]*(?4))*[\t ]*\])/', $string, $matches);

$statements = $matches[0];
$columns = $matches[1];
$operators = $matches[2];
$values = $matches[3];

There will also be a $matches[4] but it does not really have a meaning and is only used inside the regular expression.也会有一个$matches[4]但它没有真正的意义,只在正则表达式中使用。 First, a few things you did wrong in your attempt:首先,您在尝试中做错了一些事情:

  • (.+) will consume as much as possible, and any character. (.+)会消耗尽可能多的,任何字符。 So if you have something inside a string value that looks like IN 13 then your first repetition might consume everything until there and return it as the column.因此,如果您在字符串值中有一些看起来像IN 13那么您的第一次重复可能会消耗所有内容,并将其作为列返回。 It also allows whitespace and = inside column names.它还允许在列名中使用空格和= There are two ways around this.有两种方法可以解决这个问题。 Either making the repetition "ungreedy" by appending ?要么通过附加使重复“不贪婪” ? or, even better, restrict the allowed characters, so you cannot go past the desired delimiter.或者,更好的是,限制允许的字符,这样您就不能超过所需的分隔符。 In my regex I only allow letters, digits and underscores ( \\w ) for column identifiers.在我的正则表达式中,我只允许使用字母、数字和下划线 ( \\w ) 作为列标识符。
  • [\\t| ] [\\t| ] this mixes up two concepts: alternation and character classes. [\\t| ]这混淆了两个概念:交替和字符类。 What this does is "match a tab, a pipe or a space".它的作用是“匹配制表符、管道或空格”。 In character classes you simply write all characters without delimiting them.在字符类中,您只需编写所有字符而无需对其进行分隔。 Alternatively you could have written (\\t| ) which would be equivalent in this case.或者,您可以编写(\\t| ) ,这在这种情况下是等效的。
  • [.+] I don't know what you were trying to accomplish with this, but it matches either a literal . [.+]我不知道你想用这个来完成什么,但它匹配一个文字. or a literal + .或文字+ And again it might be useful to restrict the allowed characters, and to check for correct matching of quotes (to avoid 'some string" )再次限制允许的字符并检查引号的正确匹配可能很有用(以避免'some string"

Now for an explanation of my own regex (you can copy this into your code, as well, it will work just fine; plus you have the explanation as comments in your code):现在解释一下我自己的正则表达式(您也可以将其复制到您的代码中,它会正常工作;另外,您在代码中将解释作为注释):

preg_match_all('/
    (\w+)           # match an identifier and capture in $1
    [\t ]+          # one or more tabs or spaces
    (IN|<|>|=|!)    # the operator (capture in $2)
    [\t ]+          # one or more tabs or spaces
    (               # start of capturing group $3 (the value)
        (           # start of subpattern for single-valued literals (capturing group $4)
            \'      # literal quote
            [^\']*  # arbitrarily many non-quote characters, to avoid going past the end of the string
            \'      # literal quote
        |           # OR
            "[^"]*" # equivalent for double-quotes
        |           # OR
            \d+     # a number
        )           # end of subpattern for single-valued literals
    |               # OR (arrays follow)
        \[          # literal [
        [\t ]*      # zero or more tabs or spaces
        (?4)        # reuse subpattern no. 4 (any single-valued literal)
        (?:         # start non-capturing subpattern for further array elements
            [\t ]*  # zero or more tabs or spaces
            ,       # a literal comma
            [\t ]*  # zero or more tabs or spaces
            (?4)    # reuse subpattern no. 4 (any single-valued literal)
        )*          # end of additional array element; repeat zero or more times
        [\t ]*      # zero or more tabs or spaces
        \]          # literal ]
    )               # end of capturing group $3
    /',
    $string,
    $matches);

This makes use of PCRE's recursion feature where you can reuse a subpattern (or the whole regular expression) with (?n) (where n is just the number you would also use for a backreference).这利用了 PCRE 的递归功能,您可以在其中使用(?n)重用子模式(或整个正则表达式(?n) (其中n只是您也将用于反向引用的数字)。

I can think of three major things that could be improved with this regex:我可以想到可以用这个正则表达式改进的三个主要方面:

  • It does not allow for floating-point numbers它不允许浮点数
  • It does not allow for escaped quotes (if your value is 'don\\'t do this' , I would only captur 'don\\' ).它不允许转义引号(如果您的值是'don\\'t do this' ,我只会捕获'don\\' )。 This can be solved using a negative lookbehind .这可以使用否定的lookbehind来解决。
  • It does not allow for empty arrays as values (this could be easily solved by wrapping all parameters in a subpattern and making it optional with ? )它不允许将空数组作为值(这可以通过将所有参数包装在一个子模式中并使用?使其可选来轻松解决)

I included none of these, because I was not sure whether they apply to your problem, and I thought the regex was already complex enough to present here.我没有包括这些,因为我不确定它们是否适用于您的问题,而且我认为正则表达式已经足够复杂,可以在这里展示。

Usually regular expressions are not powerful enough to do proper language parsing anyway.通常正则表达式的功能不足以进行正确的语言解析。 It is generally better to write your parser.通常最好编写解析器。

And since you said your regex'ing is horrible... while regular expressions seem like a lot of black magic due to their uncommon syntax, they are not that hard to understand, if you take the time once to get your head around their basic concepts.既然你说你的正则表达式很糟糕......虽然正则表达式由于其不常见的语法而看起来像是很多黑魔法,但它们并不难理解,如果你花点时间了解一下它们的基本概念。 I can recommend this tutorial .我可以推荐这个教程 It really takes you all the way through!它真的带你一路走来!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM