简体   繁体   English

将Perl正则表达式转换为等效的ECMAScript正则表达式

[英]Convert Perl regular expression to equivalent ECMAScript regular expression

Now I'm using VC++ 2010, but the syntax_option_type of VC++ 2010 only contains the following options: 现在,我使用的是VC ++ 2010,但是VC ++ 2010的syntax_option_type仅包含以下选项:

static const flag_type icase = regex_constants::icase;
static const flag_type nosubs = regex_constants::nosubs;
static const flag_type optimize = regex_constants::optimize;
static const flag_type collate = regex_constants::collate;
static const flag_type ECMAScript = regex_constants::ECMAScript;
static const flag_type basic = regex_constants::basic;
static const flag_type extended = regex_constants::extended;
static const flag_type awk = regex_constants::awk;
static const flag_type grep = regex_constants::grep;
static const flag_type egrep = regex_constants::egrep;

It doesn't contain perl_syntax_group(Boost Library has the option). 它不包含perl_syntax_group(Boost库具有此选项)。 However, I don't want to use the Boost Library. 但是,我不想使用Boost库。

There are many regular expression written in Perl, So I want to convert the existing Perl regular expressions to ECMAScript (or any one that VC++ 2010 support). Perl中有许多正则表达式,因此,我想将现有的Perl正则表达式转换为ECMAScript (或VC ++ 2010支持的任何正则表达式)。 After conversion I can use the equivalent regular expressions directly in VC++ 2010 without using the third party libray. 转换后,我可以直接在VC ++ 2010中使用等效的正则表达式,而无需使用第三方libray。

One example: 一个例子:

const boost::tregex e(__T("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"));
const CString human_format = __T("$1-$2-$3-$4");
CString human_readable_card_number(const CString& s)
{
   return boost::regex_replace(s, e, human_format);
}
CString credit_card_number = "1234567887654321";
credit_card_number = human_readable_card_number(credit_card_number);
assert(credit_card_number == "1234-5678-8765-4321");

In the above example, what I want to do is convert e and format to ECMAScript style expressions. 在上面的示例中,我想做的是将eformat转换为ECMAScript样式表达式。

Is it possible to find a general way to convert all Perl regular expressions to ECMAScript style? 是否可以找到将所有Perl正则表达式转换为ECMAScript样式的通用方法? Are there some tools to do this? 有一些工具可以做到这一点吗?

Any help will be appreciated! 任何帮助将不胜感激!

For the particular regex you want to convert, the equivalent in ECMA regex is: 对于要转换的特定正则表达式,ECMA正则表达式中的等效项为:

/^(\d{3,4})[- ]?(\d{4})[- ]?(\d{4})[- ]?(\d{4})$/

In this case, \\A (in Perl regex) has the same meaning as ^ (in ECMA regex) (matching beginning of the string) and \\Z (in Perl regex) has the same meaning as $ (in ECMA regex) (matching the end of the string). 在这种情况下, \\A (在Perl regex中)的含义与^ (在ECMA regex中)(匹配字符串的开头)相同, \\Z (在Perl regex中)与$ (在ECMA regex中)相同的含义(匹配字符串的结尾)。 Note that meaning of ^ and $ in ECMA regex will change to matching the beginning and the end of the line if you enable multiline mode. 请注意,如果启用多行模式,则ECMA正则表达式中^$含义将更改为与行的开头和结尾匹配。

ECMA regex is a subset of Perl regex, so if the regex uses exclusive features in Perl regex, it is likely that it is not convertible to ECMA regex. ECMA regex是Perl regex的子集,因此,如果该regex使用Perl regex中的专有功能,则可能无法转换为ECMA regex。 Even for same syntax, the syntax may mean slightly different thing between 2 dialects of regex, so it is always wise to check the documentation and compare the usage. 即使对于相同的语法,该语法在正则表达式的两个方言之间可能意味着稍有不同,因此检查文档并比较用法总是明智的。

I'm only going to say what is similar between ECMA regex and Perl regex. 我只想说说ECMA regex和Perl regex有何相似之处。 What is not similar, but convertible, I will mention it to the most of my ability. 什么不是相似但可转换,我将尽我所能提起。

ECMA regex is lacking on features to work with Unicode, which compels you to look up the code points and specify them as character classes. ECMA正则表达式缺少与Unicode一起使用的功能,这些功能迫使您查找代码点并将其指定为字符类。

Going according to the documentation for Perl regular expression : 按照有关Perl正则表达式文档进行操作

  • Modifiers: 修饰符:
    • Only i , g , m are in ECMA Standard, and they behave the same as in Perl. ECMA标准中只有igm ,它们的行为与Perl中的相同。
    • s dot-all modifier can be simulated in ECMA regex by using 2 complementing character classes eg [\\S\\s] , [\\D\\d] s dot-all修饰符可以在ECMA正则表达式中通过使用2个补码字符类来模拟,例如[\\S\\s][\\D\\d]
    • No support in anyway for x and p flag. 无论如何,不​​支持xp标志。
    • I don't know if there is anyway to simulate the rest (prefix and suffix modifiers). 我不知道是否还有其他模拟方式(前缀和后缀修饰符)。
  • Meta characters: 元字符:
    • I have a bit of doubt about using \\ with non-meta character that doesn't resolve to any special meaning, but it should be fine if you don't escape where you don't need to. 对于将\\与非元字符结合使用并不能解决任何特殊含义,我有些怀疑,但是如果您不逃避不需要的地方就没问题了。 . in ECMA excludes a few more characters. 在ECMA中,排除了另外几个字符。 The rest behaves the same in ECMA regex (even effect of m flag on ^ and $ ). 其余的在ECMA正则表达式中表现相同(甚至m标志对^$ )。
  • Quantifier: 量词:
    • Greedy and Lazy behavior should be the same. 贪婪和懒惰的行为应该是相同的。 There is no possessive behavior in ECMA regex. ECMA正则表达式中没有所有格行为。
  • Escape sequences: 转义序列:
    • There's no \\a and \\e in ECMA regex. ECMA正则表达式中没有\\a\\e \\t , \\n , \\r , \\f are the same. \\t\\n\\r\\f相同。
    • Check the documentation if the regex has \\cX - there are differences. 如果正则表达式具有\\cX请检查文档-有所不同。
    • \\xhh is common in ECMA regex and Perl regex (specifying 2 hexadecimal digits is the safest - otherwise, you will have to look up the documentation to see how the language will deal with the case where there are less than 2 hexadecimal digits). \\xhh在ECMA regex和Perl regex中很常见(指定2个十六进制数字是最安全的-否则,您将必须查阅文档以查看该语言如何处理少于2个十六进制数字的情况)。
    • \\uhhhh is ECMA regex exclusive feature to specify Unicode character. \\uhhhh是ECMA regex 专有功能,用于指定Unicode字符。 Perl has other exclusive ways to specify character such as \\x{} , \\N{} , \\o{} , \\000 . Perl还有其他专用的字符指定方式,例如\\x{}\\N{}\\o{}\\000
    • \\l , \\u\u003c/code> , \\L , \\U are exclusive to Perl regex. \\l\\u\u003c/code> , \\L\\U是Perl regex 专有的
    • \\Q and \\E can be simulated by escaping the quoted section by hand. 可以通过手动转引引用的部分来模拟\\Q\\E
    • Octal escape (which has less than 3 octal digits) in Perl regex may be confusing. Perl正则表达式中的八进制转义符(少于3个八进制数字)可能会造成混淆。 Check the context carefully, read the documentation, and/or test the regex to make sure you understand what it is doing in context, since it might be either escaped sequence or back reference. 仔细检查上下文,阅读文档和/或测试正则表达式以确保您了解它在上下文中的作用,因为它可能是转义序列或向后引用。
  • Character classes and other special escapes: 角色类和其他特殊转义符:
    • \\w , \\W , \\s , \\S , \\d , \\D are equivalent in ECMA regex and Perl regex, if assuming US-ASCII. 如果采用US-ASCII,则\\w\\W\\s\\S\\d\\D在ECMA regex和Perl regex中是等效的。 If Unicode is involved, things will be a bloody mess. 如果涉及Unicode,事情将变得一团糟。
    • No POSIX character class in ECMA regex. ECMA正则表达式中没有POSIX字符类。 Use the above \\w , \\s , \\d or specify yourself in character class. 使用上面的\\w\\s\\d或在角色类中指定自己。
    • Back reference is mostly the same - but I don't know if it allows the back reference to go beyond 9 for both Perl and ECMA regex. 反向引用基本相同-但我不知道它是否允许Perl和ECMA regex的反向引用都超过9。
    • Named reference can be simulated with back reference. 可以使用反向参考来模拟命名参考。
    • The rest (except [] and already mentioned escaped sequences) are unsupported in ECMA regex. ECMA regex不支持其余的( []和已提及的转义序列除外)。
  • Assertion: 断言:
    • \\b and \\B are equivalent in both languages, with regards to how they are defined based on \\w . 关于如何基于\\w定义\\b\\B在两种语言中都是等效的。
  • Capture groups: Grouping () and back reference are the same. 捕获组:分组()和反向引用相同。 $n , which is used in the replacement string to back reference to matched text, is the same. $n (在替换字符串中用于反向引用匹配的文本)是相同的。 The rest in the section are Perl exclusive features. 本节中的其余部分是Perl独有的功能。
  • Quoting meta-characters: (Content already mentioned in previous sections). 引用元字符:(前面部分中已经提到的内容)。
  • Extended Pattern: 扩展模式:
    • ECMA regex doesn't support modification of flags inside regex. ECMA regex不支持在regex内部修改标志。 Depending on what the flags are, you may be able to rewrite the regex ( s flag is one that can always be converted to equivalent expression in ECMA regex). 根据标志的不同,您可能可以重写正则表达式( s标志是始终可以在ECMA regex中转换为等效表达式s标志)。
    • Only (?:pattern) (non-capturing group), (?=pattern) (positive look ahead), (?!pattern) (negative look ahead) are common between Perl and ECMA. 在Perl和ECMA之间,只有(?:pattern) (非捕获组), (?=pattern) (正视), (?!pattern) (负视)是常见的。
    • There is no comment in ECMA regex, so (?#text) can be ignored. ECMA正则表达式中没有注释,因此(?#text)可以忽略。
    • Look-behinds are not supported in ECMA regex. ECMA正则表达式不支持向后看。 Fixed-width look-behind is supported in Perl. Perl支持固定宽度的向后搜索。 In some cases, regex with positive look behind written in Perl can be converted to ECMA regex, by making the look-behind a capturing group. 在某些情况下,通过在捕获组后面进行查找,可以将用Perl编写的具有正向外观的正则表达式转换为ECMA正则表达式。
    • As mentioned before, named pattern can be converted to normal capture group and can be referred to with numbered back reference. 如前所述,命名模式可以转换为正常捕获组,并可以用编号的反向引用进行引用。
    • The rest are Perl exclusive features. 其余是Perl独有的功能。
  • Special Backtracking Control Verbs: This is Perl exclusive, and I have no idea what these do (never touched them before), let alone conversion. 特殊的回溯控制动词:这是Perl独有的,我不知道它们的作用(以前从未接触过它们),更不用说转换了。 It's most likely the case that they are not convertible anyway. 最有可能的是它们无论如何都不能转换。

Conclusion : 结论

If the regex utilize the full power of Perl regex, or at the level which Boost library supports (eg recursive regex), it is not possible to convert the regex to ECMA regex. 如果正则表达式可以充分利用Perl正则表达式的功能,或者在Boost库支持的级别(例如,递归正则表达式)使用,则无法将正则表达式转换为ECMA正则表达式。 Fortunately, ECMA regex covers the most commonly used features, so it's likely that the regex are convertible. 幸运的是,ECMA正则表达式涵盖了最常用的功能,因此正则表达式很可能是可转换的。

Reference : 参考

ECMA RegExp Reference on MDN 有关MDN的ECMA RegExp参考

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM