[英]Java Split on Spaces and Special Characters
I am trying to split a string on spaces and some specific special characters. 我试图在空格和一些特定的特殊字符上拆分字符串。
Given the string "john - & + $ ? . @ boy" I want to get the array: 鉴于字符串“john - &+ $?。@ boy”我想得到数组:
array[0]="john";
array[1]="boy";
I've tried several regular expressions and gotten no where. 我已经尝试了几个正则表达式,并没有在哪里。 Here is my current stab: 这是我目前的刺:
String[] terms = uglString.split("\\s+|[\\-\\+\\$\\?\\.@&].*");
Which preserves "john" but not "boy". 这保留了“约翰”而不是“男孩”。 Can anyone get me the rest of this? 任何人都能得到我剩下的这个吗?
Just use: 只需使用:
String[] terms = input.split("[\\s@&.?$+-]+");
You can put a short-hand character class inside a character class (note the \\s
), and most meta-character loses their meaning inside a character class, except for [
, ]
, -
, &
, \\
. 你可以把一个简写字符类放在一个字符类中(注意\\s
),大多数元字符在字符类中失去意义,除了[
, ]
, -
, &
, \\
。 However, &
is meaningful only when comes in pair &&
, and -
is treated as literal character if put at the beginning or the end of the character class. 但是, &
仅在&&
出现时才有意义,并且-
如果放在字符类的开头或结尾,则被视为文字字符。
Other languages may have different rules for parsing the pattern, but the rule about -
applies for most of the engines. 其他语言可能有不同的规则来解析模式,但规则-
适用于大多数引擎。
As @Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. 正如@Sean Patrick Floyd在他的回答中提到的,重要的是归结为定义一个单词的构成。 \\w
in Java is equivalent to [a-zA-Z0-9_]
(English letters upper and lower case, digits and underscore), and therefore, \\W
consists of all other characters. Java中的\\w
等效于[a-zA-Z0-9_]
(英文字母大写和小写,数字和下划线),因此\\W
由所有其他字符组成。 If you want to consider Unicode letters and digits, you may want to look at Unicode character classes . 如果要考虑Unicode字母和数字,可能需要查看Unicode字符类 。
You could make your code much easier by replacing your pattern with "\\\\W+"
(one or more occurrences of a non-word character. (This way you are whitelisting characters instead of blacklisting, which is usually a good idea) 您可以通过将模式替换为"\\\\W+"
(一个或多个非单词字符)来使代码更容易。(这样您将字符列入白名单而不是黑名单,这通常是一个好主意)
And of Course things could be made more efficient by using Guava's Splitter
class 当然,使用Guava的Splitter
类可以提高效率
试试这个......
Input.replace("-&+$?.@"," ").split(" ");
to add to what have been said about Splitter
, you can do something of this sort: 要添加到关于Splitter
,你可以做一些这样的事情:
String str = "john - & + $ ? . @ boy";
Iterable<String> ttt = Splitter.on(Pattern.compile("\\W")).trimResults().omitEmptyStrings().split(str);
Breaking then step by step: 然后一步一步地打破:
For your case, you replace non-word chars (as pointed out). 对于您的情况,您替换非单词字符(如指出)。 Now you might want to preserve the spaces for an easy String split. 现在,您可能希望保留空间以便进行简单的String拆分。
String ugly = "john - & + $ ? . @ boy";
String words = ugly.replaceAll("[^\\w\\s]", "");
There are a lot of spaces in the resulting String which you might want to generally trim to just 1 space: 结果字符串中有很多空格,您可能希望通常修剪为1个空格:
String formatted = words.trim().replaceAll(" +", " ");
Now you can easily split the String into the words to a String Array: 现在,您可以轻松地将字符串拆分为字符串数组:
String[] terms = formatted.split("\\s");
System.out.println(terms[0]);
Use this format. 使用此格式。
String s = "john - & + $ ? . @ boy";
String reg = "[!_.',@? ]";
String[] res = s.split(reg);
here include every character that you want to split inside the [ ]
brackets. 这里包括你要在[ ]
括号内分割的每个字符。
You can use something like below 你可以使用下面的东西
arrayOfStringType=string.split(" |'|,|.|//+|_");
'|' '|' will work as an or operator here. 将在这里作为一个或运营商。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.