简体   繁体   English

C#Regex用多个捕获替换奇怪的行为,并在字符串末尾进行匹配?

[英]C# Regex Replace weird behavior with multiple captures and matching at the end of string?

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern: 我正在尝试编写一些格式化巴西电话号码的格式,但是我希望它从字符串的末尾而不是开头匹配,因此它将根据以下模式转换输入字符串:

"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"

Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty. 由于开始部分通常会发生变化,因此我想到了使用$符号来构建匹配项,因此它将从末尾开始,然后向后捕获(因此我认为),然后将其替换为所需的结束格式,然后才得到如果它们为空,请删除前面的括号“()”。

This is the C# code: 这是C#代码:

s = "5135554444";
string str = Regex.Replace(s, @"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, @"^\(\) ", ""); //Get rid of empty () at the beginning

The return value was as expected for a 10 digit number. 返回值是预期的10位数字。 But for anything less than that, it ended up showing some strange behavior. 但是,除此之外,它最终表现出一些奇怪的行为。 These were my results: 这些是我的结果:

"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"

It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this: 看起来它忽略了最后的$来进行匹配,只是如果我用少于7位的数字进行测试,它会像这样:

"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"

Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me? 请注意,它使第三个捕获组的“最小” {n}次始终从头开始捕获,但是随后,前两个组从头开始捕获,就好像最后一个组从头开始是非贪婪的一样,只是得到最低限度...很奇怪还是是我?

Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results: 现在,如果我更改模式,那么我将使用{4}而不是第三次捕获时的{1,4}:

str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");

"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?

I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order? 我知道这可能是我的愚蠢,但是如果我想在字符串的末尾进行捕获,那么以前所有的捕获组都将以相反的顺序捕获,这不是更合理吗?

I would think that "54444" would turn into "5-4444" in this last example... then it does not... 我认为在最后一个示例中,“ 54444”将变成“ 5-4444” ...然后,它不会...

How would one accomplish this? 一个人如何做到这一点?

(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting problem, but to understand the Regex sintax)... (我知道也许有更好的方法可以使用不同的方法来完成相同的事情……但是我真正好奇的是找出为什么Regex的这种特殊行为看起来很奇怪。因此,这个问题的答案应该集中在在解释为什么最后一个捕获锚定在字符串的末尾,以及为什么其他捕获不锚定(如本示例所示)。因此,我对实际的电话#格式问题不是特别感兴趣,而是了解了Regex sintax) ...

Thanks... 谢谢...

So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits? 因此,您希望第三部分始终具有四个数字,第二部分为零至四个数字,而第一部分为零至两个数字,但前提是第二部分包含四个数字?

Use 采用

^(\d{0,2}?)(\d{0,4})(\d{4})$

As a C# snippet, commented: 作为C#代码段,评论:

resultString = Regex.Replace(subjectString, 
  @"^             # anchor the search at the start of the string
    (\d{0,2}?)    # match as few digits as possible, maximum 2
    (\d{0,4})     # match up to four digits, as many as possible
    (\d{4})       # match exactly four digits
    $             # anchor the search at the end of the string", 
   "($1) $2-$3", RegexOptions.IgnorePatternWhitespace);

By adding a ? 通过添加? to a quantifier ( ?? , *? , +? , {a,b}? ) you make it lazy, ie tell it to match as few characters as possible while still allowing an overall match to be found. 对于量词( ??*?+? {a,b}? ),您可以使其变得懒惰,即告诉它尽可能少地匹配字符,同时仍然可以找到整体匹配项。

Without the ? 没有? in the first group, what would happen when trying to match 123456 ? 在第一组中,尝试匹配123456会发生什么?

First, the \\d{0,2} matches 12 . 首先, \\d{0,2}匹配12

Then, the \\d{0,4} matches 3456 . 然后, \\d{0,4}匹配3456

Then, the \\d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. 然后, \\d{4}没有任何可匹配的内容,因此正则表达式引擎回溯,直到再次可能。 After four steps, the \\d{4} can match 3456 . 经过四个步骤, \\d{4}可以匹配3456 The \\d{0,4} gives up everything it had matched greedily for this. \\d{0,4}为此放弃了所有与贪婪匹配的内容。

Now, an overall match has been found - no need to try any more combinations. 现在,已找到一个整体匹配-无需尝试任何其他组合。 Therefore, the first and third groups will contain parts of the match. 因此,第一和第三组将包含部分匹配项。

You have to tell it that it's OK if the first matching groups aren't there, but not the last one: 您必须告诉它,如果没有第一个匹配的组,但是没有最后一个匹配的组,则可以:

(\d{0,2}?)(\d{0,4}?)(\d{1,4})$

Matches your examples properly in my testing. 在我的测试中正确匹配了您的示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM