简体   繁体   English

与使用带有多个分隔符的split混淆

[英]Confused with using split with multiple delimiters

I'm practicing reading input and then tokenizing it. 我正在练习阅读输入,然后将其标记化。 For example, if I have [882,337] I want to just get the numbers 882 and 337. I tried using the following code: 例如,如果我有[882,337],我只想获取数字882和337。我尝试使用以下代码:

    String test = "[882,337]";
    String[] tokens = test.split("\\[|\\]|,");
    System.out.println(tokens[0]);
    System.out.println(tokens[1]);
    System.out.println(tokens[2]);

It kind of works, the output is: (blank line) 882 337 这类作品,输出为:(空白)882 337

What I don't understand is why token[0] is empty? 我不明白的是为什么token [0]为空? I would expect there to only be two tokens where token[0] = 882 and token[1] = 337. 我希望只有两个令牌,其中token [0] = 882和token [1] = 337。

I checked out some links but didn't find the answer. 我检查了一些链接,但没有找到答案。

Thanks for the help! 谢谢您的帮助!

Split splits the given String . Split 拆分给定的String If you split "[882,337]" on "[" or "," or "]" then you actually have: 如果在“ [”或“,”或“]”上split “ [882,337]”,则实际上有:

  • nothing 没有
  • 882 882
  • 337 337
  • nothing 没有

But, as you have called String.split(delimiter) , this calls String.split(delimiter, limit) with a limit of zero. 但是,正如您调用String.split(delimiter) ,这将调用String.split(delimiter, limit)limit为零。

From the documentation : 文档中

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. limit参数控制应用图案的次数,因此会影响所得数组的长度。 If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n , and the array's last entry will contain all input beyond the last matched delimiter. 如果限制n大于零,则将最多应用n - 1次该模式,该数组的长度将不大于n ,并且该数组的最后一个条目将包含除最后一个匹配的定界符之外的所有输入。 If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. 如果n为非正数,则该模式将被尽可能多地应用,并且数组可以具有任何长度。 If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded. 如果n为零,则该模式将被尽可能多地应用,该数组可以具有任何长度,并且尾随的空字符串将被丢弃。

(emphasis mine) (强调我的)

So in this configuration the final, empty, strings are discarded. 因此,在此配置中,最终的空字符串将被丢弃。 You are therefore left with exactly what you have. 因此,您只剩下您所拥有的。


Usually, to tokenize something like this, one would go for a combination of replaceAll and split : 通常,要标记这样的东西,可以将replaceAllsplit结合使用:

final String[] tokens = input.replaceAll("^\\[|\\]$").split(",");

This will first strip off the start ( ^[ ) and end ( ]$ ) brackets and then split on , . 这将首先删除开始( ^[ )和结束( ]$ )括号,然后在,分割。 This way you don't have to have somewhat obtuse program logic where you start looping from an arbitrary index. 这样,您不必从开始从任意索引开始循环的程序逻辑就变得有些晦涩。


As an alternative, for more complex tokenizations, one can use Pattern - might be overkill here, but worth bearing in mind before you get into writing multiple replaceAll chains. 作为替代,对于更复杂的tokenizations,可以使用Pattern -可能是矫枉过正这里,但值得铭记你进入写多前replaceAll链。

First we need to define, in Regex, the tokens we want (rather than those we're splitting on) - in this case it's simple, it's just digits so \\d . 首先,我们需要在Regex中定义所需的令牌(而不是要分割的令牌)-在这种情况下,它很简单,只是数字,所以\\d

So, in order to extract all digit only (no thousands/decimal separators) values from an arbitrary String on would do the following: 因此,为了从任意String上仅提取所有数字(没有千位/十进制分隔符)的值,请执行以下操作:

final List<Integer> tokens = new ArrayList<>();    <-- to hold the tokens
final Pattern pattern = Pattern.compile("\\d++");  <-- the compiled regex
final Matcher matcher = pattern.matcher(input);    <-- the matcher on input

while(matcher.find()) {                            <-- for each matched token
    tokens.add(Integer.parseInt(matcher.group())); <-- parse and `int` and store
}

NB: I have used a possessive regex pattern for efficiency 注意:我使用了所有格式正则表达式来提高效率

So, you see, the above code is somewhat more complex than the simple replaceAll().split() , but it is much more extensible. 因此,您可以看到,上面的代码比简单的replaceAll().split()更为复杂,但它的可扩展性更高。 You can use arbitrary complex regex to token almost any input. 您可以使用任意复杂的正则表达式来标记几乎所有输入。

The symbols where the string is split are here: 字符串被分割的符号在这里:

String test = "[882,337]";
               ^   ^   ^

Because The first char matches your delimiter, everything left from it will be the first result. 因为第一个字符与您的定界符匹配,所以它剩下的所有内容都是第一个结果。 Well, left from the first letter is nothing, so the result is the empty string. 好吧,第一个字母后面没有任何内容,因此结果是空字符串。

One could expect the same behaviour for the end, since the last symbol also matches the delimiter. 由于最后一个符号也与定界符匹配,因此人们可能会期望最后的行为相同。 But : 但是

Trailing empty strings are therefore not included in the resulting array. 因此,结尾的空字符串不包括在结果数组中。

See Javadoc . 参见Javadoc

That's because each delimiter has a "before" and "after" result, even if it is empty. 这是因为每个定界符都有一个“之前”和“之后”结果,即使它为空。 Consider 考虑

882,337 882,337

You expect that to produce two results. 您希望这会产生两个结果。 Similarly, you expect 同样,您期望

882,337, 882,337,

to produce three, with the last one being empty (assuming your limit is big enough, or assuming you're using almost any other language / implementation of split() ). 产生三个,最后一个为空(假设您的限制足够大,或者假设您正在使用split()几乎所有其他语言/实现)。 Extending that logically, 从逻辑上扩展

,882,337, ,882,337,

must produce four, with the first and last results being empty. 必须产生四个,第一个和最后一个结果为空。 This is exactly the case you have, except you have multiple delimiters. 除了您有多个定界符外,这确实是您的情况。

Splitting creates two (or more) things from one thing. 拆分从一件事创建两(或更多)件事。 For instance if you split a,b by , you will get a and b . 例如,如果将a,b除以,则将得到ab

But in case of ",b" you will get "" and "b" . 但是在",b"情况下,您将得到"""b" You can think of it this way: "" exists at start, end and even in-between all characters of string: 您可以这样想: ""出现在字符串的所有字符的开头,结尾甚至中间:

""+","+"b" -> ",b" so if we split on this "," we are getting left and right part: "" and "b" ""+","+"b" -> ",b"所以如果我们分割此","我们将得到左右部分: """b"


Similar things happens in case of "a," and at first result array is ["a",""] but here split method removes trailing empty strings and returns only ["a"] (you can turn off this clearing mechanism by using split(",", -1) ). "a,"情况下也会发生类似的情况,首先结果数组是["a",""]但是这里的split方法将删除结尾的空字符串并仅返回["a"] (您可以通过使用以下方式关闭此清除机制split(",", -1) )。

So in case of 所以在

String test = "[882,337]";
String[] tokens = test.split("\\[|\\]|,");

you are splitting: 你正在分裂:

     ""+"["+"882"+","+"337"+"]"+""
here:    ^         ^         ^

which at first creates array ["", "882", "337", ""] but then trailing empty string is removed and finally you are receiving: 首先创建数组["", "882", "337", ""]但随后的空字符串被删除,最后您收到:

["", "882", "337"]

Only case where empty string is removed from start of result array is when 只有从结果数组的开头删除空字符串的情况是

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM