简体   繁体   中英

RegExp infinite loop only in perl, why?

I have a regular expression to test whether a CSV cell contains a correct file path:

EDIT The CSV lists filepaths that does not yet exists when script runs (I cannot use -e), and filepath can include * or %variable% or {$variable}.

my $FILENAME_REGEXP = '^(|"|""")(?:[a-zA-Z]:[\\\/])?[\\\/]{0,2}(?:(?:[\w\s\.\*-]+|\{\$\w+}|%\w+%)[\\\/]{0,2})*\1$';

Since CSV cells sometimes contains wrappers of double quotes, and sometimes the filename itself needs to be wrapped by double quotes, I made this grouping (|"|""") ... \\1

Then using this function:

sub ValidateUNCPath{
    my $input = shift;
    if ($input !~ /$FILENAME_REGEXP/){
        return;
    } 
    else{
        return "This is a Valid File Path.";
    }

}

I'm trying to test if this phrase is matching my regexp (It should not match):

"""c:\my\dir\lord"

but my dear Perl gets into infinite loop when:

ValidateUNCPath('"""c:\my\dir\lord"');

EDIT actually it loops on this:

ValidateUNCPath('"""\aaaaaaaaa\bbbbbbb\ccccccc\Netwxn00.map"');

I made sure in http://regexpal.com that my regexp correctly catches those non-symmetric """ ... " wrapping double quotes, but Perl got his own mind :(

I even tried the /g and /o flags in

/$FILENAME_REGEXP/go

but it still hangs. What am I missing ?

First off, nothing you have posted can cause an infinite loop, so if you're getting one, its not from this part of the code.

When I try out your subroutine, it returns true for all sorts of strings that are far from looking like paths, for example:

.....
This is a Valid File Path.
.*.*
This is a Valid File Path.
-
This is a Valid File Path.

This is because your regex is rather loose.

^(|"|""")                  # can match the empty string
(?:[a-zA-Z]:[\\\/])?       # same, matches 0-1 times
[\\\/]{0,2}                # same, matches 0-2 times
(?:(?:[\w\s\.\*-]+|\{\$\w+}|%\w+%)[\\\/]?)+\1$  # only this is not optional

Since only the last part actually have to match anything, you are allowing all kinds of strings, mainly in the first character class: [\\w\\s\\.\\*-]

In my personal opinion, when you start relying on regexes that look like yours, you're doing something wrong. Unless you're skilled at regexes, and hope noone who isn't will ever be forced to fix it.

Why don't you just remove the quotes? Also, if this path exists in your system, there is a much easier way to check if it is valid: -e $path

Update

Edit: From trial and error, the below grouping sub-expression [\\w\\s.*-]+ is causing backtrack problem

    (?:
        (?:
             [\w\s.*-]+
          |  \{\$\w+\}
          |  %\w+%
        )
        [\\\/]?
    )+

Fix #1, Unrolled method

'
 ^
    (                          # Nothing
      |"                       # Or, "
      |"""                     # Or, """
    )
                      # Here to end, there is no provision for quotes (")
    (?:               # If there are no balanced quotes, this will fail !!
        [a-zA-Z]
        :
        [\\\/]
    )?
    [\\\/]{0,2}

    (?:
        [\w\s.*-]
      |  \{\$\w+\}
      |  %\w+%
    )+
    (?:
        [\\\/]
        (?:
            [\w\s.*-]
          |  \{\$\w+\}
          |  %\w+%
        )+
    )*
    [\\\/]?
    \1
 $
'

Fix #2, Independent Sub-Expression

'
 ^
    (                          # Nothing
      |"                       # Or, "
      |"""                     # Or, """
    )
                      # Here to end, there is no provision for quotes (")
    (?:               # If there are no balanced quotes, this will fail !!
        [a-zA-Z]
        :
        [\\\/]
    )?
    [\\\/]{0,2}

    (?>
       (?:
           (?:
                [\w\s.*-]+
             |  \{\$\w+\}
             |  %\w+%
           )
           [\\\/]?
       )+
    )
    \1
 $
'

Fix #3, remove the + quantifier (or add +?)

'
 ^
    (                          # Nothing
      |"                       # Or, "
      |"""                     # Or, """
    )
                      # Here to end, there is no provision for quotes (")
    (?:               # If there are no balanced quotes, this will fail !!
        [a-zA-Z]
        :
        [\\\/]
    )?
    [\\\/]{0,2}

    (?:
        (?:
             [\w\s.*-] 
          |  \{\$\w+\}
          |  %\w+%
        )
        [\\\/]?
    )+
    \1
 $
'

If the regex engine was naïve,

('y') x 20 =~ /^.*.*.*.*.*x/

would take a very long time to fail since it has to try

20 * 20 * 20 * 20 * 20 = 3,200,000 possible matches.

Your pattern has a similar structure, meaning it has many components match wide range of substrings of your input.

Now, Perl's regex engine is highly optimised, and far far from naïve. In the above pattern, it will start by looking for x , and exit very very fast. Unfortunately, it doesn't or can't similarly optimise your pattern.

Your patterns is a complete mess. I'm not going to even try to guess what it's suppose to match. You will find that this problem will solve itself once you switch to a correct pattern.

Thanks to sln this is my fixed regexp:

my $FILENAME_REGEXP = '^(|"|""")(?:[a-zA-Z]:[\\\/])?[\\\/]{0,2}(?:(?:[\w\s.-]++|\{\$\w+\}|%\w+%)[\\\/]{0,2})*\*?[\w.-]*\1$';

(I also disallowed * char in directories, and only allowed single * in (last) filename)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM