繁体   English   中英

如何按所有标点符号类型拆分ArrayList中的文本文件?

[英]How do I split a text file in an ArrayList by all type of punctuation?

到目前为止,这是我的代码:

import java.util.*;
import java.io.*;

public class Alice {

    public static void main(String[] args) throws IOException {

        /*
         * To put the text document into an ArrayList
         */
        Scanner newScanner = new Scanner(new File("ALICES ADVENTURES IN WONDERLAND.txt"));

        ArrayList<String> list = new ArrayList<String>();

        while (newScanner.hasNext()) {
            list.add(newScanner.next());
        }
        newScanner.close();
    }
}

我现在仍然无法按所有标点符号拆分文档,但是我仍然需要能够对文本中的单词执行String操作。 请帮助

输入内容是整本《爱丽丝与仙境》全书,我需要输出如下内容:

“这本书仅供使用等。”

基本上所有单词都是分开的,所有标点符号都从文档中删除了。

List <String> list = new ArrayList <> ();
Pattern wordPattern = Pattern.compile ("\\w+");
try (BufferedReader reader = new BufferedReader (new FileReader ("ALICES ADVENTURES IN WONDERLAND.txt"))) {
    String line;
    while ((line = reader.readLine ()) != null) {
        Matcher matcher = wordPattern.matcher (line);
        while (matcher.find())
            list.add (matcher.group());
    }
}

您可以使用\\p{Punct}. 正则表达式的字符类作为分隔符。 以下给出以下输出。

String regex = "\\p{Punct}.";
String phrase = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.";
Scanner scanner = new Scanner(phrase);
scanner.useDelimiter(Pattern.compile(regex));

List<String> list = new ArrayList<String>(); // <- Try also as much as possible to work with interfaces

while (scanner.hasNext()) {
    list.add(scanner.next());
}

list.forEach(System.out::println);
scanner.close();

结果

Lorem Ipsum is simply dummy text of the printing and typesetting industry
Lorem Ipsum has been the industry
 standard dummy text ever since the 1500s
when an unknown printer took a galley of type and scrambled it to make a type specimen book
It has survived not only five centuries
but also the leap into electronic typesetting
remaining essentially unchanged
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM