简体   繁体   English

如何有效地将字符串与大量正则表达式匹配

[英]How to effectively match a string with lots of regular expressions

I want to be able to effectively match a string with a number of regular expressions to determine what this string represents. 我希望能够有效地将字符串与许多正则表达式匹配,以确定此字符串代表什么。

^[0-9]{1}$         if string matches it is of type 1
^[a-x]{300}$       if string matches it is of type 2
...                ...

Iterating over a collection containing all of the regular expressions every time I want to match a string is way too heavy for me. 每次我想匹配一个字符串时,都要对包含所有正则表达式的集合进行迭代,这对我来说太麻烦了。

Is there any more effective way? 有没有更有效的方法? Maybe I can compile these regexps into one big one? 也许我可以将这些正则表达式编译成一个大的正则表达式? Maybe something that works like Google Suggestions, analysing letter after letter? 也许像Google Recommendations这样的东西可以分析一个又一个字母吗?

In my project, I am using PHP/MySQL, however I will be thankful for a clue in any language. 在我的项目中,我正在使用PHP / MySQL,但是对于任何语言的线索我都会感激不尽。

Edit: Operation of matching a string will be very frequent and string values will vary. 编辑:匹配字符串的操作将非常频繁,并且字符串值将有所不同。

What you could do, if possible, is grouping your regexes together and determine in which group a string belongs. 如果可能的话,您可以做的就是将正则表达式分组在一起,并确定字符串属于哪个组。

For instance, if a string doesn't match \\d , you know there is no digit in it and you can skip all regexes that require one. 例如,如果字符串与\\d不匹配,则说明其中没有数字,您可以跳过所有需要一个的正则表达式。 So (for instance) instead of matching against +300 regexes, you can narrow that down to just 25. 因此(例如)您可以将其范围缩小到25个,而不是与+300个正则表达式匹配。

You can sum up your regexes like this: 您可以像这样总结您的正则表达式:

^([0-9])|([a-x]{300})$

Later, if you get more regex, you can do this: 以后,如果您获得更多的正则表达式,则可以执行以下操作:

^([0-9])|([a-x]{300})|([x-z]{1,5})|([ab]{2,})$...

Then use this code: 然后使用以下代码:

$input=...
preg_match_all('#^([0-9])|([a-x]{300})$#', $input, $matches);

foreach ($matches as $val) {
    if (isset($val[1])) {
       // type 1
    } else if (isset($val[2])) {
       // type 2
    }
    // and so on...
}

Since the regexes are going to be changing, I don't think you can get a generic answer - both your regex(es), and the way you handle them will need to evolve. 由于正则表达式将会发生变化,因此我认为您无法获得一个通用的答案-您的正则表达式和处理它们的方式都将有所发展。 For now, if you're looking to optimize the processing of your script, test for known strings before evaluating using something like indedOf to lighten the regex load. 现在,如果您要优化脚本的处理,请在评估之前使用indedOf东西测试已知字符串,以减轻正则表达式的负担。

For instance, if you have 4 strings: 例如,如果您有4个字符串:

  • asdfsdfkjslkdujflkj2lkjsdlkf2lkja asdfsdfkjslkdujflkj2lkjsdlkf2lkja
  • 100010010100111010100101001001011 10001001010011101010010100100101011
  • 101032021309420940389579873987113 101032021309420940389579873987113
  • asdfkajhslkdjhflkjshdlfkjhalksjdf asdfkajhslkdjhflkjshdlfkjhalksjdf

Each belongs to a different "type" as you've described it, so you could do: 正如您所描述的,每个都属于不同的“类型”,因此您可以执行以下操作:

//type 1 only contains 0 or 1
//type 2 must have a "2"
//type 3 contains only letters

var arr = [
    "asdfsdfkjslkdujflkj2lkjsdlkf2lkja",
    "100010010100111010100101001001011",
    "101032021309420940389579873987113",
    "asdfkajhslkdjhflkjshdlfkjhalksjdf"
    ];

for (s in arr)
{
    if (arr[s].indexOf('2') > 0)
    {
        //type 2
    }
    else if (arr[s].indexOf('0') > 0)
    {
        if ((/^[01]+$/g).test(arr[s]))
            //type 1
        else
            //ignore
    }
    else if ((/^[a-z]+$/gi).test(arr[s]))
        //type 3
}

See it in action here: http://jsfiddle.net/remus/44MdX/ 在此处查看其运行情况: http : //jsfiddle.net/remus/44MdX/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM