简体繁体 English

RegEx基于组的不同替换？

[英]RegEx different substitutions based on groups?

原文 2017-09-04 05:04:28 2 2 c#/ .net/ regex/ vb.net

So I'm relatively n00bish at regular expressions, and doing a little practicing. 所以我在正则表达式方面比较笨，并且做了一点练习。

I'm playing with a dog-simple "deobfucator" that just looks for [dot] or (dot) or [at] or (at) . 我正在玩一只简单的“deobfucator”，它只是寻找[dot]或(dot)或[at]或(at) 。 Case-insensitive, and with or w/out any number of spaces before or after the match(s). 不区分大小写，并且在比赛之前或之后带有或没有任何数量的空格。

This is for the usual: someemail [AT] domain (dot) com type of thing. 这是通常的： someemail [AT] domain (dot) com类型的东西。 I want to obviously turn it into someemail@domain.com . 我想显然把它变成someemail@domain.com 。

The RegEx I've come up with does the matching fine, but now I want to replace with either a . RegEx我想出了匹配罚款，但现在我想用a替换. or a @ depending on the match. 或@取决于匹配。

ie I want the group matching the "dot" group to replace it with the literal . 即我希望与“点”组匹配的组将其替换为文字. , and the group matching the "at" group with the literal @ . ，以及将“at”组与文字@匹配的组。

I know I could just write 2 different (almost identical) RegEx's and run it through both, but for the sake of education, I'm trying to see if I can do it all in one RegEx? 我知道我可以写两个不同的（几乎相同的）RegEx并通过它们运行它，但为了教育，我试图看看我是否可以在一个RegEx中完成所有操作？

Here's the RegEx I came up with (probably not the smallest possible, which I'd also be interested in seeing): 这是我提出的RegEx（可能不是最小的，我也有兴趣看到）：

+(\\[|\$)(dot)(\$|\\]) +| +(\\[|\$)(at)(\$|\\]) +

NOTE: before each + there's an empty space, for matching spaces. 注意：在每个+之前有一个空的空格，用于匹配空格。

What I'm looking for is what I would use to do the replacement(s) properly? 我正在寻找的是我会用什么来正确地进行更换？

Update: Sorry all, forgot to add which language I was working with for this. 更新：对不起，忘了添加我正在使用的语言。 In this case, I'm using a clipboard utility that can run RegEx's on it's input (whatever gets copied to the clipboard), and the engine it uses is C#/VB.NET . 在这种情况下，我正在使用一个剪贴板实用程序，它可以在它的输入上运行RegEx（无论什么被复制到剪贴板）， 它使用的引擎是C＃/ VB.NET 。 Ultimate goal for this little project is to just be able to copy an "obfuscated" email address or URL, and run the RegEx on it so that it's set on the clipboard in it's "unobfuscated" state. 这个小项目的最终目标是能够复制“混淆”的电子邮件地址或URL，并在其上运行RegEx，以便将其设置在剪贴板上，使其处于“未混淆”状态。

That said, I do tend to use RegEx's on many different languages, so converting them between languages generally isn't an issue. 也就是说，我确实倾向于在许多不同的语言中使用RegEx，因此在语言之间进行转换通常不是问题。

2 个解决方案

.NET regex does not support conditional replacement patterns. .NET正则表达式不支持条件替换模式。

for the sake of education, I'm trying to see if I can do it all in one RegEx? 为了教育，我试图看看我是否可以在一个RegEx中完成所有工作？

There are other regex engines that allow conditional replacement logic in a single regex replacement operation with conditional replacement patterns . 还有其他正则表达式引擎允许在具有条件替换模式的单个正则表达式替换操作中使用条件替换逻辑。

There are 3 engines that support this type of replacements: JGsoft V2, Boost, and PCRE2. 有3种引擎支持这种类型的替换：JGsoft V2，Boost和PCRE2。

For conditionals to work in Boost, you need to pass regex_constants::format_all to regex_replace . 要使条件在Boost中工作，您需要将regex_constants::format_all传递给regex_replace 。 For them to work in PCRE2, you need to pass PCRE2_SUBSTITUTE_EXTENDED to pcre2_substitute . 要让它们在PCRE2中工作，您需要将PCRE2_SUBSTITUTE_EXTENDED传递给pcre2_substitute 。

In PCRE2: 在PCRE2中：

${1:+matched:unmatched} where 1 is a number between 1 and 99 referencing a numbered capturing group. ${1:+matched:unmatched}其中1是1到99之间的数字，引用编号的捕获组。 If your regex contains named capturing groups then you can reference them in a conditional by their name: ${name:+matched:unmatched} . 如果您的正则表达式包含命名捕获组，那么您可以通过其名称在条件中引用它们： ${name:+matched:unmatched} 。

If you want a literal colon in the matched part, then you need to escape it with a backslash. 如果你想在匹配的部分中使用文字冒号，那么你需要用反斜杠转义它。 If you want a literal closing curly brace anywhere in the conditional, then you need to escape that with a backslash too. 如果你想在条件中的任何地方使用文字结束大括号，那么你需要用反斜杠转义它。 Plus signs have no special meaning beyond the :+ that starts the conditional, so they don't need to be escaped. 加号以外没有特殊含义:+启动条件，所以它们不需要转义。

Also, see The Boost-Specific Format Sequences : 另请参阅特定于Boost的格式序列 ：

When specifying the format_all flag to regex_replace() , the escape sequences recognized are the same as those above for format_perl . 为regex_replace()指定format_all标志时，识别的转义序列与format_perl的转义序列相同。 In addition, conditional expressions of the following form are recognized: 此外，还承认以下形式的条件表达式：

?Ntrue-expression:false-expression

where N is a decimal digit representing a sub-match. 其中N是表示子匹配的十进制数字。 If the corresponding sub-match participated in the full match, then the substitution is true-expression. 如果相应的子匹配参与完全匹配，则替换为true-expression。 Otherwise, it is false-expression. 否则，它是假表达。 In this mode, you can use parens () for grouping. 在此模式下，您可以使用parens ()进行分组。 If you want a literal paren, you must escape it as \\( . 如果你想要一个文字paren，你必须将它作为\\( 。

In Boost replacement patterns, literal ( and ) must be escaped. 在Boost替换模式中，必须转义文字(和) 。

The syntax for JGsoft V2 replacement string conditionals is the same as that in the C++ Boost library. JGsoft V2替换字符串条件的语法与C ++ Boost库中的语法相同。

So, your regex can be contracted to ( +)[[(](?:(dot)|(at))[])]( +) : 所以，你的正则表达式可以缩小为( +)[[(](?:(dot)|(at))[])]( +) ：

( +) - Group 1: one or more spaces ( +) - 第1组：一个或多个空格
[[(] - a [ or ( [[(] - 一个[或(
(?:(dot)|(at)) - Either (Group 2) a dot substring or (Group 3) an at substring (?:(dot)|(at)) - （子组2） dot串或（组3） at子串
[])] - a ) or ] [])] - a )或]
( +) - Group 4: one or more spaces ( +) - 第4组：一个或多个空格

And replace with $1(?{3}.:@)$4 : 并替换为$1(?{3}.:@)$4 ：

$1 - Group 1 value, $1 - 第1组价值，
(?{3}.:@) - if Group 3 matched, replace with . (?{3}.:@) - 如果第3组匹配，则替换为. , else with @ ，用@
$4 - Group 4 value. $4 - 第4组价值。

This is available in Notepad++: 这在Notepad ++中可用：

If you are using Java, try replaceAll method from String class. 如果您使用的是Java，请尝试使用String类中的replaceAll方法。

And finally you need to normalize it with white spaces: 最后，您需要使用空格对其进行标准化：
- Pure Java - String after = before.trim().replaceAll("\\\\s+", " "); - Pure Java - String after = before.trim().replaceAll("\\\\s+", " ");
- Pure Java - String after = before.replaceAll("\\\\s{2,}", " ").trim(); - Pure Java - String after = before.replaceAll("\\\\s{2,}", " ").trim();
- Apache commons lang3 - String after = StringUtils.normalizeSpace(String str); - Apache commons lang3 - String after = StringUtils.normalizeSpace(String str);
- ... - ......