简体   繁体   中英

How to effectively match a string with lots of regular expressions

I want to be able to effectively match a string with a number of regular expressions to determine what this string represents.

^[0-9]{1}$         if string matches it is of type 1
^[a-x]{300}$       if string matches it is of type 2
...                ...

Iterating over a collection containing all of the regular expressions every time I want to match a string is way too heavy for me.

Is there any more effective way? Maybe I can compile these regexps into one big one? Maybe something that works like Google Suggestions, analysing letter after letter?

In my project, I am using PHP/MySQL, however I will be thankful for a clue in any language.

Edit: Operation of matching a string will be very frequent and string values will vary.

What you could do, if possible, is grouping your regexes together and determine in which group a string belongs.

For instance, if a string doesn't match \\d , you know there is no digit in it and you can skip all regexes that require one. So (for instance) instead of matching against +300 regexes, you can narrow that down to just 25.

You can sum up your regexes like this:

^([0-9])|([a-x]{300})$

Later, if you get more regex, you can do this:

^([0-9])|([a-x]{300})|([x-z]{1,5})|([ab]{2,})$...

Then use this code:

$input=...
preg_match_all('#^([0-9])|([a-x]{300})$#', $input, $matches);

foreach ($matches as $val) {
    if (isset($val[1])) {
       // type 1
    } else if (isset($val[2])) {
       // type 2
    }
    // and so on...
}

Since the regexes are going to be changing, I don't think you can get a generic answer - both your regex(es), and the way you handle them will need to evolve. For now, if you're looking to optimize the processing of your script, test for known strings before evaluating using something like indedOf to lighten the regex load.

For instance, if you have 4 strings:

  • asdfsdfkjslkdujflkj2lkjsdlkf2lkja
  • 100010010100111010100101001001011
  • 101032021309420940389579873987113
  • asdfkajhslkdjhflkjshdlfkjhalksjdf

Each belongs to a different "type" as you've described it, so you could do:

//type 1 only contains 0 or 1
//type 2 must have a "2"
//type 3 contains only letters

var arr = [
    "asdfsdfkjslkdujflkj2lkjsdlkf2lkja",
    "100010010100111010100101001001011",
    "101032021309420940389579873987113",
    "asdfkajhslkdjhflkjshdlfkjhalksjdf"
    ];

for (s in arr)
{
    if (arr[s].indexOf('2') > 0)
    {
        //type 2
    }
    else if (arr[s].indexOf('0') > 0)
    {
        if ((/^[01]+$/g).test(arr[s]))
            //type 1
        else
            //ignore
    }
    else if ((/^[a-z]+$/gi).test(arr[s]))
        //type 3
}

See it in action here: http://jsfiddle.net/remus/44MdX/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM