简体   繁体   English

如何一次从文件中扫描n个单词?

[英]How do you scan in n number of words from a file at a time?

I have a really large file of text (1 million words+) and am trying to read it in in sections to avoid excessive memory usage and to try to speed it up. 我有一个非常大的文本文件(超过100万个单词),正尝试分节阅读,以避免过多的内存使用并试图加快速度。 I am trying to read in 10k words at a time, place the unique words in that section in an array, and read the next 10k to do the same. 我试图一次读入10k个单词,将唯一的单词放在该节中的数组中,然后阅读下一个10k做同样的事情。 I have worked out this so far: 到目前为止,我已经解决了这个问题:

while(scn.hasNext) {                       // Check if there is anything in the file
    for(int i = 10000; i > 0; i--) {       // For the next 10000 strings,
        if(scn.hasNext) {                  // as long as the file doesnt end,
            fullBook.add(scn.next());      // add the word to the string I am working on.
        }
        else {
            break;
        }
    }
}

All of this would be encased in yet another while so that I can work with each string before reading in the next 10k. 所有这些都将被封装在另外一阵子中,以便在读取下一个10k之前,我可以使用每个字符串。 I figure there is a faster way, but I haven't found it yet. 我认为有一种更快的方法,但是还没有找到。 I have looked through Scanner and Buffered reader to see if I could find a method that would only do so many words but I keep coming up empty. 我已经浏览了Scanner和Buffered阅读器,看是否可以找到一种方法,该方法只能处理很多单词,但我一直空白。 I don't mind learning a new method in order to do this or just some trick to speed it up. 我不介意为了做到这一点而学习一种新方法,或者只是一些技巧来加快它。 Thanks for the help in advance! 我在这里先向您的帮助表示感谢!

Your code is without difference from the below code in single loop. 您的代码与下面的单循环代码没有区别。

while (scn.hasNext()) {
    fullBook.add(scn.next());

In fact, it is not meaningful to do in using 2 loops. 实际上,使用2个循环是没有意义的。 The buffer of the Scanner is not changed and is always 1024. You may see it in the source of Scanner . Scanner的缓冲区未更改,始终为1024。您可能会在“ Scanner的源代码中看到它。

As the speed of I/O is slow, you may want to increase the buffer size and read from file with less frequency. 由于I / O的速度很慢,您可能需要增加缓冲区大小并以较低的频率从文件中读取。 You may change the creation of your Scanner with below code instead. 您可以改为使用以下代码更改Scanner的创建。

// Create a buffered reader with 1M buffer
Scanner scn = new Scanner (new BufferedReader(new FileReader(fileLocation), 1048576)); 

Let me know if there is a better way to do this. 让我知道是否有更好的方法可以做到这一点。

Note: Scanner is NOT thread safe, @Alex recommends using a RandomAccessFile to bypasst his issue. 注意:扫描仪不是线程安全的,@ Alex建议使用RandomAccessFile绕过他的问题。

Use a Thread like 使用一个Thread

public class Parser implements UnitParserListener {


public Parser() {
    for(int i = 0; i < 1_000_000; i += 10_000) {
        new UnitParser(scanner, this, i);
    }
}

public void unitCompleted(int startCount, String[] words) {
    // This method will be called once for each thread completion
}

private class UnitParser implements Runnable {

    private UnitParserListener listener;
    private Thread thread;
    private int startCount;
    private Scanner scanner;

    public UnitParser(Scanner scanner, UnitParserListener listener, int startCount) {
        this.scanner = scanner.
        // Start the thread
        this.startCount = startCount;
        this.listener = listener;

        thread = new Thread(this);
        thread.start();
    }

    public void run() {
        // You'll have to edit this to your liking
        while(scn.hasNext()) {                       // Check if there is anything in the file
            for(int i = startCount; i < startCount + 10_000; i++) {

            }
        }

        // Thread complete
        listener.unitCompleted(startCount, results);

        // Attempt to stop the thread
        try {
            thread.join();
        } catch(Exception e) {}
    }

}

}
interface UnitParserListener {
    // startCount will give us a way to identify the thread
    void unitCompleted(int startCount, String[] words);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从文件中打印出多个单词? - How do you print out multiple words from a file? 如何将.txt文件中字符串的每一行中的单个字符扫描到2D数组中? - How do you scan individual characters from each line of string in a .txt file into a 2D array? 如何在.txt文件中存储输入数字中的最高数字(以字为单位)? - How do I store in a .txt file the highest number(in words) from input numbers? 如何扫描文本文件并将其转换为二维字符数组? - How do you scan a text file and convert it to a 2D char array? 如何根据代码片段计算方程式中的 T(n)? - How do you calculate T(n) in an equation from a code fragment? 如何一次打印x行数 - How do you print x number of lines at a time 如何从 Java 中的文本文件中解析像“1.8400e-016”这样的数字? - How do you parse a number like “1.8400e-016” from a text file in Java? 如何从控制台扫描单词并存储在某处(Java)? - How to scan words from console and store somewhere (Java)? 如何从Java中的文件中随机读取N行 - How to read N number of lines from a file in java randomly 如何使用“Range”扫描accumulo中的整个表格 - How do you use “Range” to Scan an entire table in accumulo
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM