简体   繁体   English

在 Java 中逐行读取文本文件的最快方法

[英]Quickest way to read text-file line by line in Java

For log processing my application needs to read text files line by line.对于日志处理,我的应用程序需要逐行读取文本文件。 First I used the function readLine() of BufferedReader but I read on the internet that BufferedReader is slow when reading files.首先我使用了 BufferedReader 的 function readLine() 但我在网上看到 BufferedReader 在读取文件时速度很慢。
Afterwards I tried to use FileInputStream together with a FileChannel and MappedByteBuffer but in this case there's no function similar to readLine() so I search my text for a line-break and process it:之后我尝试将 FileInputStream 与 FileChannel 和 MappedByteBuffer 一起使用,但在这种情况下,没有类似于 readLine() 的 function 所以我在我的文本中搜索换行符并处理它:

    try {
        FileInputStream f = new FileInputStream(file);
        FileChannel ch = f.getChannel( );
        MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0L, ch.size());
        byte[] bytes = new byte[1024];
        int i = 0;
        while (mb.hasRemaining()) {
            byte get = mb.get();
            if(get == '\n') {
                if(ra.run(new String(bytes)))
                    cnt++;
                for(int j = 0; j<=i; j++)
                    bytes[j] = 0;
                i = 0;
            }
            else
                bytes[i++] = get;
        }
    } catch(Exception ex) {
        ex.printStackTrace();
    }

I know this is probably not a good way to implement it but when I just read the text-file in bytes it is 3 times faster then using BufferedReader but calling new String(bytes) creates a new String and makes the program even slower then when using a BufferedReader.我知道这可能不是实现它的好方法,但是当我以字节为单位读取文本文件时,它比使用 BufferedReader 快 3 倍,但调用new String(bytes)会创建一个新 String 并使程序变得更慢使用 BufferedReader。
So I wanted to ask what is the fastest way to read a text-file line by line?所以我想问一下逐行读取文本文件的最快方法是什么? Some say BufferedReader is the only solution to this problem.有人说 BufferedReader 是解决这个问题的唯一方法。

PS: ra is an instance of RunAutomaton from the dk.brics.Automaton library. PS: ra是来自 dk.brics.Automaton 库的 RunAutomaton 的一个实例。

I very much doubt that BufferedReader is going to cause a significant overhead.我非常怀疑BufferedReader会导致大量开销。 Adding your own code is likely to be at least as inefficient, and quite possibly wrong too.添加您自己的代码可能至少效率低下,而且很可能也是错误的。

For example, in the code that you've given you're calling new String(bytes) which is always going to create a string from 1024 bytes, using the platform default encoding... not a good idea.例如,在您提供的代码中,您正在调用new String(bytes) ,它总是会使用平台默认编码从 1024 个字节创建一个字符串……这不是一个好主意。 Sure, you clear the array afterwards, but your strings are still going to contain a bunch of '\0' characters - which means a lot of wasted space, apart from anything else.当然,之后你清除了数组,但是你的字符串仍然会包含一堆 '\0' 字符——这意味着除了其他任何东西之外,还会浪费很多空间。 You should at least restrict the portion of the byte array the string is being created from (which also means you don't need to clear the array afterwards).至少应该限制正在从中创建字符串的字节数组部分(这也意味着您之后不需要清除数组)。

Have you actually tried using BufferedReader and found it to be too slow?您是否真的尝试过使用BufferedReader并发现它太慢了? You should usually write the simplest code which will meet your goals first, and then check whether it's fast enough... especially if your only reason for not doing so is an unspecified resource you "read on the internet".您通常应该首先编写最简单的代码来满足您的目标,然后检查它是否足够快......特别是如果您不这样做的唯一原因是您“在互联网上阅读”的未指定资源。 DO you want me to find hundreds of examples of people spouting incorrect performance suggestions?你想让我找到数百个提出不正确性能建议的人的例子吗? :) :)

As an alternative, you might want to look at Guava 's overload of Files.readLines() which takes a LineProcessor .作为替代方案,您可能想查看GuavaFiles.readLines()重载,它采用LineProcessor

Using plain BufferedReader I got 100+ MB/s . 使用普通的 BufferedReader 我得到了 100+ MB/s It is highly likely that the speed you can read the data from disk is your bottle neck, so how you do the reading won't make much difference.您可以从磁盘读取数据的速度很可能是您的瓶颈,因此您如何进行读取不会有太大的不同。

BufferedReader is not the only solution, but it is fast enough for 99% of use cases, so why make things more complicated than they need to be? BufferedReader 不是唯一的解决方案,但它对于 99% 的用例来说已经足够快了,那么为什么要让事情变得比需要的更复杂呢?

Are frameworks an alternative?框架是替代品吗?

I dont know about the performance, but我不知道性能,但是

http://commons.apache.org/io/ http://commons.apache.org/io/

http://commons.apache.org/io/api-release/index.html See IOUtils class http://commons.apache.org/io/api-release/index.html参见 IOUtils ZA24F2ED4F8EBCA6CBB14F2ED4F8EBCA2

defines very easy to use helper classes for such cases.为这种情况定义了非常易于使用的辅助类。

i have a very simple loop that reads about 2000 lines (50k bytes) from a file on the sdcard using BufferedReader and it reads them all in about 100mS in debug mode on galaxy tab 2. not too bad.我有一个非常简单的循环,它使用 BufferedReader 从 sdcard 上的文件中读取大约 2000 行(50k 字节),并在 Galaxy 选项卡 2 的调试模式下在大约 100 毫秒内读取它们。还不错。 then i put a Scanner in the loop and the time went through the roof (tens of seconds), plus lots of GC_CONCURANT messages然后我把一个扫描仪放在循环中,时间过了屋顶(几十秒),加上很多 GC_CONCURANT 消息

Scanner scanner = new Scanner(line);
int eventType = scanner.nextInt(16);

so at least in my case its the Scanner that's the problem, i guess i need to scan the ints another way, but i have no idea why it could be so slow所以至少在我的情况下,问题在于扫描仪,我想我需要以另一种方式扫描整数,但我不知道为什么它会这么慢

According to this SO post, you might also want to give the Scanner class a shot.根据这篇SO 帖子,您可能还想试一试扫描仪class。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM