[英]How do you scan in n number of words from a file at a time?
I have a really large file of text (1 million words+) and am trying to read it in in sections to avoid excessive memory usage and to try to speed it up. 我有一个非常大的文本文件(超过100万个单词),正尝试分节阅读,以避免过多的内存使用并试图加快速度。 I am trying to read in 10k words at a time, place the unique words in that section in an array, and read the next 10k to do the same.
我试图一次读入10k个单词,将唯一的单词放在该节中的数组中,然后阅读下一个10k做同样的事情。 I have worked out this so far:
到目前为止,我已经解决了这个问题:
while(scn.hasNext) { // Check if there is anything in the file
for(int i = 10000; i > 0; i--) { // For the next 10000 strings,
if(scn.hasNext) { // as long as the file doesnt end,
fullBook.add(scn.next()); // add the word to the string I am working on.
}
else {
break;
}
}
}
All of this would be encased in yet another while so that I can work with each string before reading in the next 10k. 所有这些都将被封装在另外一阵子中,以便在读取下一个10k之前,我可以使用每个字符串。 I figure there is a faster way, but I haven't found it yet.
我认为有一种更快的方法,但是还没有找到。 I have looked through Scanner and Buffered reader to see if I could find a method that would only do so many words but I keep coming up empty.
我已经浏览了Scanner和Buffered阅读器,看是否可以找到一种方法,该方法只能处理很多单词,但我一直空白。 I don't mind learning a new method in order to do this or just some trick to speed it up.
我不介意为了做到这一点而学习一种新方法,或者只是一些技巧来加快它。 Thanks for the help in advance!
我在这里先向您的帮助表示感谢!
Your code is without difference from the below code in single loop. 您的代码与下面的单循环代码没有区别。
while (scn.hasNext()) {
fullBook.add(scn.next());
In fact, it is not meaningful to do in using 2 loops. 实际上,使用2个循环是没有意义的。 The buffer of the
Scanner
is not changed and is always 1024. You may see it in the source of Scanner
. Scanner
的缓冲区未更改,始终为1024。您可能会在“ Scanner
的源代码中看到它。
As the speed of I/O is slow, you may want to increase the buffer size and read from file with less frequency. 由于I / O的速度很慢,您可能需要增加缓冲区大小并以较低的频率从文件中读取。 You may change the creation of your
Scanner
with below code instead. 您可以改为使用以下代码更改
Scanner
的创建。
// Create a buffered reader with 1M buffer
Scanner scn = new Scanner (new BufferedReader(new FileReader(fileLocation), 1048576));
Let me know if there is a better way to do this. 让我知道是否有更好的方法可以做到这一点。
Note: Scanner is NOT thread safe, @Alex recommends using a RandomAccessFile
to bypasst his issue. 注意:扫描仪不是线程安全的,@ Alex建议使用
RandomAccessFile
绕过他的问题。
Use a Thread
like 使用一个
Thread
像
public class Parser implements UnitParserListener {
public Parser() {
for(int i = 0; i < 1_000_000; i += 10_000) {
new UnitParser(scanner, this, i);
}
}
public void unitCompleted(int startCount, String[] words) {
// This method will be called once for each thread completion
}
private class UnitParser implements Runnable {
private UnitParserListener listener;
private Thread thread;
private int startCount;
private Scanner scanner;
public UnitParser(Scanner scanner, UnitParserListener listener, int startCount) {
this.scanner = scanner.
// Start the thread
this.startCount = startCount;
this.listener = listener;
thread = new Thread(this);
thread.start();
}
public void run() {
// You'll have to edit this to your liking
while(scn.hasNext()) { // Check if there is anything in the file
for(int i = startCount; i < startCount + 10_000; i++) {
}
}
// Thread complete
listener.unitCompleted(startCount, results);
// Attempt to stop the thread
try {
thread.join();
} catch(Exception e) {}
}
}
}
interface UnitParserListener {
// startCount will give us a way to identify the thread
void unitCompleted(int startCount, String[] words);
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.