简体   繁体   English

使用NIO与RandomAccessFile读取文件块

[英]Using NIO vs RandomAccessFile to read chunks of files

I want to read a large text file about several GBs and process it without loading the whole file but loading chunks of it.(Processing involves counting word instances) 我想读取一个大约几个GB的大文本文件,然后在不加载整个文件但又加载大块文件的情况下对其进行处理。(处理过程涉及对单词实例进行计数)

If I'm using a concurrent hash map to process the file in parallel to make it more efficient, is there a way to use NIO or random access file to read it in chunks? 如果我正在使用并发哈希映射来并行处理文件以提高效率,是否可以使用NIO或随机访问文件来分块读取文件? Would it make it even more efficient? 它会提高效率吗?

The current implementation is using a buffered reader that goes something like this: 当前的实现使用的是一个缓冲的读取器,它的内容如下:

while(lines.size() <= numberOfLines && (line = bufferedReader.readLine()) != null) {
     lines.add(line);
}

lines.parallelStream().. // processing logic using ConcurrentHashMap

RandomAccessFile makes only sense if you intend to "jump" around within the file and your description of what you're doing doesn't sound like that. 仅当您打算在文件内“跳转”并且您对正在执行的操作的描述听起来不像那样时, RandomAccessFile才有意义。 NIO makes sense if you have to cope with lots of parallel communication going on and you want to do non-blocking operations eg on Sockets. 如果您必须处理大量并行通信,并且想要执行非阻塞操作(例如在Socket上),则NIO很有意义。 That as well doesn't seem to be your use case. 那似乎也不是您的用例。

So my suggestion is to stick with the simple approach of using a BufferedReader on top of a InputStreamReader(FileInputStream) (don't use FileReader because that doesn't allow you to specify the charset/encoding to be used) and go through the data as you showed in your sample code. 因此,我的建议是坚持使用在InputStreamReader(FileInputStream)之上使用BufferedReader的简单方法(不要使用FileReader,因为这不允许您指定要使用的字符集/编码)并遍历数据如您在示例代码中所示。 Leave away the parallelStream, only if you see bad performance you might try that out. 放弃parallelStream,只有在看到性能不佳时才可以尝试一下。

Always remember: Premature optimization is the root of all evil. 永远记住:过早的优化是万恶之源。

The obvious java 7 Solution is : 显而易见的Java 7解决方案是:

 String lines = Files.readAllLines(Paths.get("file"), StandardCharsets.UTF_8).reduce((a,b)->a+b);  

Honestly I got no Idea if it is faster but I gues under the hood it does not read it into a buffer so at least in theory it should be faster 老实说,我不知道它是否更快,但是我猜测它没有将其读入缓冲区,所以至少在理论上应该更快

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM