简体   繁体   English

用正则表达式\ w \ w *拆分字符串? \ w +?

[英]Split String with regex \w \w*? \w+?

I'm learning regexp and thought I was starting to get a grip. 我正在学习regexp,并认为我开始抓紧了。 but then... 但是之后...

I tried to split a string and I need help to understand such a simple thing as: 我试图拆分一个字符串,我需要帮助来理解这样一个简单的事情:

String input = "abcde";
System.out.println("[a-z] " + Arrays.toString(input.split("[a-z]")));
System.out.println("\\w " + Arrays.toString(input.split("\\w")));
System.out.println("\\w*? " + Arrays.toString(input.split("\\w*?")));
System.out.println("\\w+? " + Arrays.toString(input.split("\\w+?")));

The output is
[a-z] - []
\w    - []
\w*?  - [, a, b, c, d, e]
\w+?  - []

Why doesn't any of the two first lines split the String on any character? 为什么两个第一行中的任何一行都没有在任何字符上拆分字符串? The third expression \\w*?, (question mark prevents greediness) works as I expected, splitting the String on every character. 第三个表达式\\ w *?,(问号防止贪婪)按照我的预期工作,在每个字符上分割字符串。 The star, zero or more matches, returns an empty array. 星号,零个或多个匹配项返回一个空数组。

I've tried the expression within NotePad++ and in a program and it shows 5 matches as in: 我在NotePad ++和程序中尝试了表达式,它显示了5个匹配项,如:

Scanner ls = new Scanner(input);
while(ls.hasNext())
    System.out.format("%s ", ls.findInLine("\\w");

Output is: a b c d e

This really puzzles me 这真让我困惑

If you split a string with a regex, you essentially tell where the string should be cut. 如果使用正则表达式拆分字符串,则基本上可以告诉应该剪切字符串的位置。 This necessarily cuts away what you match with the regex. 这必然会削减你与正则表达式相匹配的东西。 Which means if you split at \\w , then every character is a split point and the substrings between them (all empty) are returned. 这意味着如果你在\\w分割,那么每个字符都是一个分裂点,并返回它们之间的子串(全为空)。 Java automatically removes trailing empty strings, as described in the documentation . Java会自动删除尾随的空字符串,如文档所述

This also explains why the lazy match \\w*? 这也解释了为什么懒人匹配\\w*? will give you every character, because it will match every position between (and before and after) any character (zero-width). 会给你每个角色,因为它会匹配任何角色(零宽度)之间(和之前和之后)的每个位置。 What's left are the characters of the string themselves. 剩下的是字符串本身的字符。

Let's break it down: 让我们分解一下:

  1. [az] , \\w , \\w+? [az]\\w\\w+?

    Your string is 你的字符串是

     abcde 

    And the matches are as follows: 比赛如下:

      abcde └─┘└─┘└─┘└─┘└─┘ 

    which leaves you with the substrings between the matches, all of which are empty. 这将留下匹配之间的子串,所有这些都是空的。

    The above three regexes behave the same in this regard as they all will only match a single character. 上述三个正则表达式在这方面表现相同,因为它们只匹配单个字符。 \\w+? will do so because it lacks any other constraints that might make the +? 会这样做,因为它没有任何其他约束可能会使+? try matching more than just the bare minimum (it's lazy, after all). 尝试匹配不仅仅是最低限度(毕竟它是懒惰的)。

  2. \\w*?

      abcde └┘ └┘ └┘ └┘ └┘ └┘ 

    In this case the matches are between the characters, leaving you with the following substrings: 在这种情况下,匹配位于字符之间 ,使您具有以下子字符串:

     "", "a", "b", "c", "d", "e", "" 

    Java throws the trailing empty one away, though. 不过,Java会抛弃尾随的空尾。

Let's break down each of those calls to String#split(String) . 让我们分解对String#split(String)每个调用。 It's key to notice from the Java docs that the "method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array." 从Java文档中注意到“方法的工作原理就好像通过调用给定表达式的两参数split方法和limit参数为零一样。因此,尾随空字符串不包括在结果数组中。”

"abcde".split("[a-z]"); // => []

This one matches every character (a, b, c, d, e) and results in only the empty strings between them, which are omitted. 这个匹配每个字符(a,b,c,d,e)并且只产生它们之间的空字符串,这些字符串被省略。

"abcde".split("\\w")); // => []

Again, every character in the string is a word character ( \\w ), so the result is empty strings, which are omitted. 同样,字符串中的每个字符都是一个字符( \\w ),因此结果是空字符串,这些字符串被省略。

"abcde".split("\\w*?")); // => ["", "a", "b", "c", "d", "e"]

In this case, the * means "zero or more of the preceding item" ( \\w ) which matches the empty expression seven times (once at the beginning of the string then once between each character). 在这种情况下, *表示“前一项中的零个或多个”( \\w ),它与空表达式匹配七次(一次在字符串的开头,然后在每个字符之间一次)。 So we get the first empty string then each character. 所以我们得到第一个空字符串然后每个字符。

"abcde".split("\\w+?")); // => []

Here the + means "one or more of the preceding item" ( \\w ) which matches the entire input string, resulting in only the empty string, which is omitted. 这里+表示“前一项中的一个或多个”( \\w ),它匹配整个输入字符串,只产生空字符串,省略。

Try these examples again with input.split(regex, -1) and you should see all of the empty strings. 使用input.split(regex, -1)再次尝试这些示例,您应该看到所有空字符串。

String.split cuts the string at each match of the pattern: String.split在模式的每个匹配项处剪切字符串:

The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. 此方法返回的数组包含此字符串的每个子字符串,该子字符串由与给定表达式匹配的另一个子字符串终止,或者由字符串的结尾终止。

So whenever the pattern like [az] is matched, the string is cut at that match. 因此,只要匹配[az]类的模式,就会在该匹配时剪切字符串。 As every character in your string is matched by the pattern, the resulting array is empty (trailing empty strings are removed). 由于字符串中的每个字符都与模式匹配,因此结果数组为空(删除尾随空字符串)。

The same applies for \\w and \\w+? 这同样适用于\\w\\w+? (one or more \\w but as little repetitions as possible). (一个或多个\\w但尽可能少重复)。 That \\w*? 那个\\w*? results in something that you expected is due to the *? 导致你期望的东西是由*? quantifier as that will match zero repetitions if possible, so an empty string. 如果可能的话,量词将匹配零重复,因此是一个空字符串。 And an empty string is found at each position in the given string. 并且在给定字符串中的每个位置都找到一个空字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM