简体   繁体   中英

Convert Perl regular expression to equivalent ECMAScript regular expression

Now I'm using VC++ 2010, but the syntax_option_type of VC++ 2010 only contains the following options:

static const flag_type icase = regex_constants::icase;
static const flag_type nosubs = regex_constants::nosubs;
static const flag_type optimize = regex_constants::optimize;
static const flag_type collate = regex_constants::collate;
static const flag_type ECMAScript = regex_constants::ECMAScript;
static const flag_type basic = regex_constants::basic;
static const flag_type extended = regex_constants::extended;
static const flag_type awk = regex_constants::awk;
static const flag_type grep = regex_constants::grep;
static const flag_type egrep = regex_constants::egrep;

It doesn't contain perl_syntax_group(Boost Library has the option). However, I don't want to use the Boost Library.

There are many regular expression written in Perl, So I want to convert the existing Perl regular expressions to ECMAScript (or any one that VC++ 2010 support). After conversion I can use the equivalent regular expressions directly in VC++ 2010 without using the third party libray.

One example:

const boost::tregex e(__T("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"));
const CString human_format = __T("$1-$2-$3-$4");
CString human_readable_card_number(const CString& s)
{
   return boost::regex_replace(s, e, human_format);
}
CString credit_card_number = "1234567887654321";
credit_card_number = human_readable_card_number(credit_card_number);
assert(credit_card_number == "1234-5678-8765-4321");

In the above example, what I want to do is convert e and format to ECMAScript style expressions.

Is it possible to find a general way to convert all Perl regular expressions to ECMAScript style? Are there some tools to do this?

Any help will be appreciated!

For the particular regex you want to convert, the equivalent in ECMA regex is:

/^(\d{3,4})[- ]?(\d{4})[- ]?(\d{4})[- ]?(\d{4})$/

In this case, \\A (in Perl regex) has the same meaning as ^ (in ECMA regex) (matching beginning of the string) and \\Z (in Perl regex) has the same meaning as $ (in ECMA regex) (matching the end of the string). Note that meaning of ^ and $ in ECMA regex will change to matching the beginning and the end of the line if you enable multiline mode.

ECMA regex is a subset of Perl regex, so if the regex uses exclusive features in Perl regex, it is likely that it is not convertible to ECMA regex. Even for same syntax, the syntax may mean slightly different thing between 2 dialects of regex, so it is always wise to check the documentation and compare the usage.

I'm only going to say what is similar between ECMA regex and Perl regex. What is not similar, but convertible, I will mention it to the most of my ability.

ECMA regex is lacking on features to work with Unicode, which compels you to look up the code points and specify them as character classes.

Going according to the documentation for Perl regular expression :

  • Modifiers:
    • Only i , g , m are in ECMA Standard, and they behave the same as in Perl.
    • s dot-all modifier can be simulated in ECMA regex by using 2 complementing character classes eg [\\S\\s] , [\\D\\d]
    • No support in anyway for x and p flag.
    • I don't know if there is anyway to simulate the rest (prefix and suffix modifiers).
  • Meta characters:
    • I have a bit of doubt about using \\ with non-meta character that doesn't resolve to any special meaning, but it should be fine if you don't escape where you don't need to. . in ECMA excludes a few more characters. The rest behaves the same in ECMA regex (even effect of m flag on ^ and $ ).
  • Quantifier:
    • Greedy and Lazy behavior should be the same. There is no possessive behavior in ECMA regex.
  • Escape sequences:
    • There's no \\a and \\e in ECMA regex. \\t , \\n , \\r , \\f are the same.
    • Check the documentation if the regex has \\cX - there are differences.
    • \\xhh is common in ECMA regex and Perl regex (specifying 2 hexadecimal digits is the safest - otherwise, you will have to look up the documentation to see how the language will deal with the case where there are less than 2 hexadecimal digits).
    • \\uhhhh is ECMA regex exclusive feature to specify Unicode character. Perl has other exclusive ways to specify character such as \\x{} , \\N{} , \\o{} , \\000 .
    • \\l , \\u\u003c/code> , \\L , \\U are exclusive to Perl regex.
    • \\Q and \\E can be simulated by escaping the quoted section by hand.
    • Octal escape (which has less than 3 octal digits) in Perl regex may be confusing. Check the context carefully, read the documentation, and/or test the regex to make sure you understand what it is doing in context, since it might be either escaped sequence or back reference.
  • Character classes and other special escapes:
    • \\w , \\W , \\s , \\S , \\d , \\D are equivalent in ECMA regex and Perl regex, if assuming US-ASCII. If Unicode is involved, things will be a bloody mess.
    • No POSIX character class in ECMA regex. Use the above \\w , \\s , \\d or specify yourself in character class.
    • Back reference is mostly the same - but I don't know if it allows the back reference to go beyond 9 for both Perl and ECMA regex.
    • Named reference can be simulated with back reference.
    • The rest (except [] and already mentioned escaped sequences) are unsupported in ECMA regex.
  • Assertion:
    • \\b and \\B are equivalent in both languages, with regards to how they are defined based on \\w .
  • Capture groups: Grouping () and back reference are the same. $n , which is used in the replacement string to back reference to matched text, is the same. The rest in the section are Perl exclusive features.
  • Quoting meta-characters: (Content already mentioned in previous sections).
  • Extended Pattern:
    • ECMA regex doesn't support modification of flags inside regex. Depending on what the flags are, you may be able to rewrite the regex ( s flag is one that can always be converted to equivalent expression in ECMA regex).
    • Only (?:pattern) (non-capturing group), (?=pattern) (positive look ahead), (?!pattern) (negative look ahead) are common between Perl and ECMA.
    • There is no comment in ECMA regex, so (?#text) can be ignored.
    • Look-behinds are not supported in ECMA regex. Fixed-width look-behind is supported in Perl. In some cases, regex with positive look behind written in Perl can be converted to ECMA regex, by making the look-behind a capturing group.
    • As mentioned before, named pattern can be converted to normal capture group and can be referred to with numbered back reference.
    • The rest are Perl exclusive features.
  • Special Backtracking Control Verbs: This is Perl exclusive, and I have no idea what these do (never touched them before), let alone conversion. It's most likely the case that they are not convertible anyway.

Conclusion :

If the regex utilize the full power of Perl regex, or at the level which Boost library supports (eg recursive regex), it is not possible to convert the regex to ECMA regex. Fortunately, ECMA regex covers the most commonly used features, so it's likely that the regex are convertible.

Reference :

ECMA RegExp Reference on MDN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM