简体   繁体   English

使用正则表达式Java进行文本拆分

[英]Text splitting with the regular expressions java

I want to split the text with the help of regex, and then write each word in console from new line. 我想在正则表达式的帮助下拆分文本,然后在控制台的新行中写出每个单词。 But there is a problem, this email@mail.org is not considered as a word and I don't know what regular should be. 但是有一个问题,这个email@mail.org并不是一个单词,我不知道应该使用什么常规。 I try to use look-ahead regex but it didn't help. 我尝试使用预读正则表达式,但没有帮助。 Should I use the additional if statement to define words or just add something in my regex? 我应该使用其他if语句来定义单词还是在正则表达式中添加一些内容? Code: 码:

Pattern p = Pattern.compile("\\s+[A-Za-z]++");
        Matcher m = p.matcher(text);
        while (m.find())
        {
                String s = m.group().replaceAll("\\s++", "");
                System.out.println(s);
        }

If all you want to do is to isolate each word from your text and print it out to the console, you can use String#split(String regex) and split on any amount of whitespace: 如果您要做的就是将每个单词从文本中隔离出来并将其打印到控制台,则可以使用String#split(String regex)并在任意数量的空白处进行拆分:

String[] words = text.split("\\s+");
for (String word : words) {
    System.out.println(word);
}

The logic here is focused on the whitespace which separates words, rather than worrying about how each actual word can be matched. 这里的逻辑集中在分隔单词的空白上,而不用担心每个实际单词如何匹配。

If you want to split on anything that isnt an upper or lowercase letter, for example split on numbers, spaces and symbols you could use: 如果要拆分不是大写或小写字母的任何内容,例如拆分数字,空格和符号,则可以使用:

String[] words = "some sentence".split("\\W+");

Essentially the reverse of what you were trying to do in your original question by providing a blacklist rather than a whitelist of allowed characters. 本质上与您在原始问题中尝试执行的操作相反,方法是提供允许字符的黑名单而不是白名单。

If you want to allow scenarios like email@mail.org and 12th and class those as words still, you could just split on a space or some sentence end character 如果您希望允许使用诸如email@mail.org12th类的场景并将其仍作为单词分类,则可以在空格或某些句子结尾字符处进行拆分

String[] words = "some sentence".split("([\\W\\s]*\\s+)");

This will split the following: 这将拆分以下内容:

email@mail.org x becomes email@mail.org and x email@mail.org x成为email@mail.orgx

hello world becomes hello and world hello world成为helloworld

hello, world becomes hello and world hello, world变成helloworld

hello; world hello; world becomes hello and world hello; world成为helloworld

hello (world) becomes hello and world and hello (world)成为helloworld (make sure you filter out empty components) (确保您过滤掉空的组件)

hello. World hello. World becomes hello and world hello. World成为helloworld

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM