Java - 如何逐字而不是逐行讀取大文件？

Question

我想閱讀 Java 中的“text8”語料庫並重新格式化一些單詞。 問題是，在這個 100MB 的語料庫中，所有單詞都在一行上。 因此，如果我嘗試使用BufferedReader和readLine加載它，它會一次占用太多空間並且無法處理它以將所有單詞分隔到一個列表/數組中。

所以我的問題是：在 Java 中是否可以逐行閱讀語料庫，逐字閱讀？ 例如，因為所有單詞都在一行上，例如每次迭代讀取 100 個單詞？

Answer 1

您可以嘗試使用Scanner並將分隔符設置為適合您的任何內容：

Scanner input=new Scanner(myFile);
input.useDelimiter(" +"); //delimitor is one or more spaces

while(input.hasNext()){
  System.out.println(input.next());
}

Answer 2

我建議你在FileReader使用“字符流”

這是來自http://www.tutorialspoint.com/java/java_files_io.htm的示例代碼

import java.io.*;

public class CopyFile {
   public static void main(String args[]) throws IOException
   {
      FileReader in = null;
      FileWriter out = null;

      try {
         in = new FileReader("input.txt");
         out = new FileWriter("output.txt");

         int c;
         while ((c = in.read()) != -1) {
            out.write(c);
         }
      }finally {
         if (in != null) {
            in.close();
         }
         if (out != null) {
            out.close();
         }
      }
   }
}

它讀取 16 位 Unicode 字符。 這樣，您的文本是否在一整行中都無關緊要。

由於您正在嘗試逐字搜索，因此您可以輕松閱讀，直到您偶然發現一個空格並找到您的單詞。

Answer 3

使用java.util.Scanner的next方法

next方法從這個掃描器中查找並返回下一個完整的標記。 一個完整的標記前后是與分隔符模式匹配的輸入。 此方法可能會在等待輸入掃描時阻塞，即使之前對 Scanner.hasNext 的調用返回 true。

例子：

public static void main(String[] args) {
        Scanner sc = new Scanner (System.in); 
        String a = sc.next();
        String b = sc.next();
        System.out.println("First Word: "+a);
        System.out.println("Second Word: "+b);
        sc.close();
    }

輸入：

你好

輸出：

第一個字：你好

第二個詞：Stackoverflow

在您的情況下，使用Scanner讀取文件，然后使用scannerobject.next()方法讀取每個標記（單詞）

Answer 4

    try(FileInputStream fis = new FileInputStream("Example.docx")) { 
        ZipSecureFile.setMinInflateRatio(0.009);
        XWPFDocument file   = new XWPFDocument(OPCPackage.open(fis));  
        ext = new XWPFWordExtractor(file);  
        Scanner scanner = new Scanner(ext.getText());
        while(scanner.hasNextLine()) {
            String[] value = scanner.nextLine().split(" ");
            for(String v:value) {
                System.out.println(v);
            }
        }
    }catch(Exception e) {  
        System.out.println(e);  
    }

Java - 如何逐字而不是逐行讀取大文件？

問題描述

4 個解決方案

解決方案1
6 2015-11-04 10:32:21

解決方案2
2 2015-11-04 10:28:28

解決方案3
1 2015-11-04 10:41:39

解決方案4
-1 2020-01-09 14:12:43

Java - 如何逐字而不是逐行讀取大文件？

問題描述

4 個解決方案

解決方案1 6 2015-11-04 10:32:21

解決方案2 2 2015-11-04 10:28:28

解決方案3 1 2015-11-04 10:41:39

解決方案4 -1 2020-01-09 14:12:43

解決方案1
6 2015-11-04 10:32:21

解決方案2
2 2015-11-04 10:28:28

解決方案3
1 2015-11-04 10:41:39

解決方案4
-1 2020-01-09 14:12:43