简体   繁体   中英

Can a Javascript RegExp match at an exact string offset without searching?

I'm parsing a moderately complex grammar using Javascript and I'd like to use regular expressions to match tokens such as numbers.

Given a string containing the grammar, a regular expression representing a number (say) and an offset within the string, I'd like to find out if the regular expression matches the string at that offset exactly.

I can set lastIndex, call RegExp.exec and check the index property of the resulting match to see if the match happened at the expected offset, but this is very inefficient because exec will search the entire string if it doesn't find a match at the starting offset.

The Javascript spec says "A Pattern evaluates ("compiles") to an internal procedure value. RegExp.prototype.exec can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String."

This is exactly what I want, but there seems to be no way of accessing this internal function. Does anyone know if there is?

PS I'm currently avoiding this problem by splitting the input string into an array of tokens, but I'd prefer not to.

I've thoroughly tested possibly efficient methods, see JSPerf: ~20000 chars, ~1000000 chars. I've created a function to generate a random string consisting of alphanumeric characters. After running this function once, a RegExp pattern is created, to match a string with length 10 at a given offset.

Tested cases (when the condition in if(..) is true, the pattern is found at offset index ):

var string = "...about 20000 characters: A-Z a-z 0-9...";
var regexp = /abcdef1324/g;
var regexpSubstr = /^abcdefg1234/;
var index = 10000;

/*1*/ if ( regexpSubstr.exec(string.substr(index,10)) ) ;
/*2*/ regexp.lastIndex = index;
      var found =  regexp.exec(string);
      if (found && found.length + index == regexp.lastIndex ) ;

/*3*/ if ( regexpSubstr.test(string.substr(index,10)) ) ;
/*4*/ // Only when the RegExp matches a fixed number of characters
      regexp.lastIndex = index;
      if ( regexp.test(string) && regexp.lastIndex == index + 10 ) ;

Case 1 and Case 3 are equivalent, because they're checking whether a substring matches the /^abcdef1234/ pattern (does the selected substring start with "abc..etc"?).

Case 2 and Case 4 use the .lastIndex method:
1. Set the .lastIndex property of the RegExp to the desired offset
2. Check whether the pattern is found.
3. Check whether a found pattern is positioned at offset index .
These methods require a Regular expression to have enabled the global flag.

At very large strings, method 4 ( lastIndex + test ) is proved to be most efficient whn a match at the offset occurs. Method 4, however, requires the matched pattern to have a pre-defined, fixed size.

Method 3 ( substr + test ) is slightly slower than 4, when a match occurs at the given position. When no match is found at a large string, however, method 3 is significantly faster than method 4. Method 1 and method 3 seems to be equally fast when no match is found.

RegExp methods
The .exec seems to not be more efficient than .test . The match method is not suitable for this case, since it attempts to find all matches, regardless of the .lastIndex property. Another possible RegExp function is the .search function, which is significantly slower for large strings, compared to previously shown methods.

If you can set the search start position, you may be able to take advantage
of the end anchor in an assertion to terminate on a real non-match.

It requires that you accept the fact that the only time it wont match (on the surface)
is when that start position falls at the end of the string.

Then post process to check the length of the capture buffer. It depends on the regex,
but the likely fail is if the capture buffer is zero length.

Example:
(\\d{6} | (?!$)) or ( (?:any subexpression) | (?!$) )

this will always match (unless the starting position is at the end of string, or
the string is empty).

Outcomes
- Does not match: The string is empty, or starting position is at the end of string.
- Matches: %99.999 of the time. If length of capture buffer is 0 (ie: ''), either the
left side of the alternation failed, or it passed by capturing nothing, regex dependent.

The start position is taken care of, however, static length matches in regex's are sometimes dificult (not impossible) to control with quantifiers. Its probably more
reasonable to use regex on substrings in the case of open ended quantifiers.

Perhaps new RegExp("^.{" + offset + "}whatever") , ie any character exactly offset times, followed by whatever? Needs benchmarking, but one would hope RegExp would optimize the "any character exactly offset times" part?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM