简体   繁体   English

Java在空格和特殊字符上的分割

[英]Java Split on Spaces and Special Characters

I am trying to split a string on spaces and some specific special characters. 我试图在空格和一些特定的特殊字符上拆分字符串。

Given the string "john - & + $ ? . @ boy" I want to get the array: 鉴于字符串“john - &+ $?。@ boy”我想得到数组:

array[0]="john";
array[1]="boy";

I've tried several regular expressions and gotten no where. 我已经尝试了几个正则表达式,并没有在哪里。 Here is my current stab: 这是我目前的刺:

String[] terms = uglString.split("\\s+|[\\-\\+\\$\\?\\.@&].*");

Which preserves "john" but not "boy". 这保留了“约翰”而不是“男孩”。 Can anyone get me the rest of this? 任何人都能得到我剩下的这个吗?

Just use: 只需使用:

String[] terms = input.split("[\\s@&.?$+-]+");

You can put a short-hand character class inside a character class (note the \\s ), and most meta-character loses their meaning inside a character class, except for [ , ] , - , & , \\ . 你可以把一个简写字符类放在一个字符类中(注意\\s ),大多数元字符在字符类中失去意义,除了[]-&\\ However, & is meaningful only when comes in pair && , and - is treated as literal character if put at the beginning or the end of the character class. 但是, &仅在&&出现时才有意义,并且-如果放在字符类的开头或结尾,则被视为文字字符。

Other languages may have different rules for parsing the pattern, but the rule about - applies for most of the engines. 其他语言可能有不同的规则来解析模式,但规则-适用于大多数引擎。

As @Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. 正如@Sean Patrick Floyd在他的回答中提到的,重要的是归结为定义一个单词的构成。 \\w in Java is equivalent to [a-zA-Z0-9_] (English letters upper and lower case, digits and underscore), and therefore, \\W consists of all other characters. Java中的\\w等效于[a-zA-Z0-9_] (英文字母大写和小写,数字和下划线),因此\\W由所有其他字符组成。 If you want to consider Unicode letters and digits, you may want to look at Unicode character classes . 如果要考虑Unicode字母和数字,可能需要查看Unicode字符类

You could make your code much easier by replacing your pattern with "\\\\W+" (one or more occurrences of a non-word character. (This way you are whitelisting characters instead of blacklisting, which is usually a good idea) 您可以通过将模式替换为"\\\\W+" (一个或多个非单词字符)来使代码更容易。(这样您将字符列入白名单而不是黑名单,这通常是一个好主意)

And of Course things could be made more efficient by using Guava's Splitter class 当然,使用Guava的Splitter可以提高效率

试试这个......

Input.replace("-&+$?.@"," ").split(" ");

to add to what have been said about Splitter , you can do something of this sort: 要添加到关于Splitter ,你可以做一些这样的事情:

    String str = "john - & + $ ? . @ boy";
    Iterable<String> ttt = Splitter.on(Pattern.compile("\\W")).trimResults().omitEmptyStrings().split(str);

Breaking then step by step: 然后一步一步地打破:

For your case, you replace non-word chars (as pointed out). 对于您的情况,您替换非单词字符(如指出)。 Now you might want to preserve the spaces for an easy String split. 现在,您可能希望保留空间以便进行简单的String拆分。

String ugly = "john - & + $ ? . @ boy";
String words = ugly.replaceAll("[^\\w\\s]", "");

There are a lot of spaces in the resulting String which you might want to generally trim to just 1 space: 结果字符串中有很多空格,您可能希望通常修剪为1个空格:

String formatted = words.trim().replaceAll(" +", " ");

Now you can easily split the String into the words to a String Array: 现在,您可以轻松地将字符串拆分为字符串数组:

String[] terms = formatted.split("\\s");
System.out.println(terms[0]);

Use this format. 使用此格式。

String s = "john - & + $ ? . @ boy";
String reg = "[!_.',@? ]";
String[] res = s.split(reg);

here include every character that you want to split inside the [ ] brackets. 这里包括你要在[ ]括号内分割的每个字符。

You can use something like below 你可以使用下面的东西

arrayOfStringType=string.split(" |'|,|.|//+|_");

'|' '|' will work as an or operator here. 将在这里作为一个或运营商。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM