简体   繁体   English

标记字符串中的特殊字符

[英]Tokenizing special characters in a string

I am working on some code and I have ran into issues regarding splitting up certain characters in a string.我正在编写一些代码,但遇到了关于拆分字符串中某些字符的问题。 When given a string below, I can separate it into separate tokens:当给出下面的字符串时,我可以将它分成单独的标记:

String line = "hello world ; how are you ;"

such as hello, world, and;比如你好,世界,和;

But when the code looks like:但是当代码看起来像:

String line2 = "hello world; how are you;"

I create tokens such as world;我创建了诸如世界之类的代币; and you;和你; when in reality I want the semicolon to be its own token.实际上,我希望分号成为它自己的标记。 Thank you in advance for the help预先感谢您的帮助

It is possible to split the second line using word boundary and remove blank lines using filter:可以使用单词边界分割第二行并使用过滤器删除空白行:

String line2 = "hello world; how are you;";

String[] arr = Arrays.stream(line2.split("\\b"))
      .filter(s -> !s.matches("\\s+"))
      .toArray(String[]::new);

System.out.println(Arrays.toString(arr));

Output: Output:

[hello, world, ; , how, are, you, ;]

Another option could be to use matching substrings instead of splitting by delimiter.另一种选择可能是使用匹配的子字符串而不是按分隔符拆分。 The matching regular expression can be:匹配的正则表达式可以是:
\w+|\S+ - at least one word character [0-9A-Za-z_] OR at least one non-space character: \w+|\S+ - 至少一个单词字符[0-9A-Za-z_]或至少一个非空格字符:

String[] arr2 = Pattern.compile("\\w+|\\S+")
                      .matcher(line2)
                      .results()
                      .map(mr -> mr.group(0))
                      .toArray(String[]::new);

The result is the same.结果是一样的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM