简体   繁体   English

为什么ReversedLinesFileReader这么慢?

[英]Why is ReversedLinesFileReader so slow?

I have a file that is 21.6GB and I want to read it from the end to the start rather than from the beginning to the end as you would usually do. 我有一个21.6GB的文件,我想从头开始读取它,而不是像通常那样从头开始读取。

If I read each line of the file from the start to the end using the following code, then it takes 1 minute, 12 seconds. 如果我使用以下代码从头至尾读取文件的每一行,则需要1分12秒。

val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLine {
    val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())

Now, I have read that to read in a file in reverse then I should use ReversedLinesFileReader from Apache Commons. 现在,我已经阅读了要反向读取的文件,然后应该使用Apache Commons的ReversedLinesFileReader I have created the following extension function to do just this: 我已经创建了以下扩展功能来做到这一点:

fun File.forEachLineFromTheEndOfFile(action: (line: String) -> Unit) {
    val reader = ReversedLinesFileReader(this, Charset.defaultCharset())
    var line = reader.readLine()
    while (line != null) {
        action.invoke(line)
        line = reader.readLine()
    }

    reader.close()
}

and then call it in the following way, which is the same as the previous way only with a call to forEachLineFromTheEndOfFile function: 然后以以下方式调用它,这与以前的方式相同,只是调用了forEachLineFromTheEndOfFile函数:

val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLineFromTheEndOfFile {
    val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())

This took 17 minutes and 50 seconds to run! 运行耗时17分钟50秒

  • Am I using ReversedLinesFileReader in the correct way? 我是否以正确的方式使用ReversedLinesFileReader
  • I am running Linux Mint with an Ext4 file system on an SSD. 我在SSD上运行带有Ext4文件系统的Linux Mint。 Could this have anything to do with it? 这可能与它有关吗?
  • Is it just the case that files should not be read from the end to the start? 是否只是从头到尾不读取文件的情况?

The correct way to investigate this problem would be: 研究此问题的正确方法是:

  1. Write a version of this test in pure Java. 用纯Java编写此测试的版本。
  2. Benchmark it to make sure that the performance problem is still there. 对它进行基准测试,以确保性能问题仍然存在。
  3. Profile it to figure out where the performance bottleneck is. 对其进行分析,以找出性能瓶颈在哪里。

Q: Am I using ReversedLinesFileReader in the correct way? 问:我是否以正确的方式使用ReversedLinesFileReader?

Yes. 是。 (Assuming that it is an appropriate thing to use a line reader at all. That depends on what it is you are really trying to do. For instance, if you just wanted to count lines backwards, then you should be reading 1 character at a time and counting the newline sequences.) (假设完全使用行读取器是适当的事情。这取决于您实际上要执行的操作。例如,如果您只想向后计数行,那么您应该一次读取一个字符时间并计算换行序列。)

Q: I am running Linux Mint with an Ext4 file system on an SSD. 问:我在SSD上运行带有Ext4文件系统的Linux Mint。 Could this have anything to do with it? 这可能与它有关吗?

Possibly. 可能吧。 Reading a file in reverse means that the read-ahead strategies used by the OS to give fast I/O may not work. 反向读取文件意味着OS用来提供快速I / O的预读策略可能不起作用。 It could be interacting with the characteristics of an SSD. 它可能与SSD的特性相互作用。

Q: Is it just the case that files should not be read from the end to the start? 问:是否只是从头到尾不读取文件的情况?

Possibly. 可能吧。 See above. 往上看。


The other thing that you have not considered is that your file could actually contain some extremely long lines. 您没有考虑的另一件事是您的文件实际上可能包含一些非常长的行。 The bottleneck could be the assembly of the characters into (long) lines. 瓶颈可能是字符组装成(长)行。

Looking at the source code , it would seem that there is potential for O(N^2) behavior when lines are very long. 查看源代码 ,当行很长时,似乎有可能发生O(N^2)行为。 The critical part is (I think) in the way that "rollover" is handled by FilePart . 关键部分是(我认为) FilePart处理“翻转”的方式。 Note the way that the "left over" data gets copied. 请注意复制“剩余”数据的方式。

You are asking for a very expensive operation. 您正在要求非常昂贵的手术。 Not only are you using random access in blocks to read the file and going backwards (so if the file system is reading ahead, it is reading the wrong direction), you are also reading an XML file which is UTF-8 and the encoding is slower than a fixed byte encoding. 您不仅会在块中使用随机访问来读取文件并向后移动(因此,如果文件系统正在向前读取,则会读取错误的方向),而且还会读取到UTF-8格式的XML文件,并且编码方式为比固定字节编码慢。

Then on top of that you are using a less than efficient algorithm. 然后最重要的是,您使用的是效率不高的算法。 It reads a block at a time of inconvenient size (is it disk block size aware? are you setting the block size to match your file system?) backwards while processing encoding and makes (unnecessary?) copy of the partial byte array and then turns it into a string (do you need a string to parse?). 它在处理编码时向后读取块时的大小不方便(是否知道磁盘块大小?是否设置了块大小以匹配文件系统?),并复制(不必要的?)部分字节数组,然后转向将其转换为字符串(您需要解析字符串吗?)。 It could create the string without the copy and really creating the string probably could be deferred and you work directly from the buffer only decoding if you need to (XML parsers for example also work from ByteArrays or buffers). 它可以创建没有副本的字符串,并且实际上可以推迟创建字符串,如果需要,您可以直接从缓冲区直接解码(例如,XML解析器也可以从ByteArrays或缓冲区工作)。 And there are other array copies that just are not needed but it is more convenient for the code. 还有其他一些不需要的数组副本,但是对于代码来说更方便。

It also might have a bug in that it checks for newlines without considering that the character might mean something different if actually is part of a multi-byte sequence. 它还可能有一个错误,即它检查换行符,而不考虑如果字符实际上是多字节序列的一部分,则字符可能意味着不同的含义。 It would have to look back a few extra characters to check this for variable length encodings, I don't see it doing that. 它必须回顾一些额外的字符,以检查它是否为可变长度编码,但我不认为这样做。

So instead of a nice forward only heavily buffered sequential read of a file which is the fastest thing you can do on your filesystem, you are doing random reads of 1 block at a time. 因此,您不是一次向前转发文件,而是仅对其进行大量缓冲的顺序读取(这是您在文件系统上可以执行的最快的操作),而是一次随机读取1个块。 It should at least read multiple disk blocks so that it can use the forward momentum (set blocksize to some multiple of your disk block size will help) and also avoid the number of "left over" copies made at buffer boundaries. 它至少应读取多个磁盘块,以便可以使用正向动量(将块大小设置为磁盘块大小的几倍会有所帮助),并且还应避免在缓冲区边界处制作“剩余”副本。

There are probably faster approaches. 可能有更快的方法。 But it'll not be as fast as reading a file in forward order. 但这不如向前读取文件那么快。

UPDATE: 更新:

Ok, so I tried an experiment with a rather silly version that processes around 27G of data by reading the first 10 million lines from wikidata JSON dump and reversing those lines. 好的,所以我尝试了一个相当愚蠢的版本的实验,该版本通过从wikidata JSON转储中读取前1000万行并将这些行反转来处理大约27G的数据。

Timings on my 2015 Mac Book Pro (with all my dev stuff and many chrome windows open eating memory and some CPU all the time, about 5G of total memory is free, VM size is default with no parameters set at all, not run under debugger): 我的2015 Mac Book Pro上的时间(所有开发人员的东西和许多chrome窗口始终打开并占用内存和一些CPU,约5G的总内存可用,VM大小默认为未设置任何参数,不在调试器下运行):

reading in reverse order: 244,648 ms = 244 secs = 4 min 4 secs
reading in forward order:  77,564 ms =  77 secs = 1 min 17 secs

temp file count:   201
approx char count: 29,483,478,770 (line content not including line endings)
total line count:  10,050,000

The algorithm is to read the original file by lines buffering 50000 lines at a time, writing the lines in reverse order to a numbered temp file. 该算法是通过一次缓冲50000行的行读取原始文件,然后以相反的顺序将行写入编号的临时文件。 Then after all files are written, they are read in reverse numerical order forward by lines. 然后,在写入所有文件之后,将以相反的数字顺序逐行读取它们。 Basically dividing them into reverse sort order fragments of the original. 基本上将它们分为原始的反向排序顺序片段。 It could be optimized because this is the most naive version of that algorithm with no tuning. 可以对其进行优化,因为这是该算法最幼稚的版本,无需调整。 But it does do what file systems do best, sequential reads and sequential writes with good sized buffers. 但是,它确实执行了文件系统最擅长的事情,即具有适当大小的缓冲区的顺序读取和顺序写入。

So this is a lot faster than the one you were using and it could be tuned from here to be more efficient. 因此,这比您所使用的要快得多,并且可以从此处进行调整以提高效率。 You could trade CPU for disk I/O size and try using gzipped files as well, maybe a two threaded model to have the next buffer gzipping while processing the previous. 您可以将CPU换成磁盘I / O大小,并尝试使用压缩文件,也许是两线程模型,以便在处理前一个缓冲区时压缩下一个缓冲区。 Less string allocations, checking each file function to make sure nothing extra is going on, make sure no double buffering, and more. 较少的字符串分配,检查每个文件功能以确保没有其他事情发生,确保没有双缓冲,等等。

The ugly but functional code is: 丑陋但实用的代码是:

package com.stackoverflow.reversefile

import java.io.File
import java.util.*

fun main(args: Array<String>) {
    val maxBufferSize = 50000
    val lineBuffer = ArrayList<String>(maxBufferSize)
    val tempFiles = ArrayList<File>()
    val originalFile = File("/data/wikidata/20150629.json")
    val tempFilePrefix = "/data/wikidata/temp/temp"
    val maxLines = 10000000

    var approxCharCount: Long = 0
    var tempFileCount = 0
    var lineCount = 0

    val startTime = System.currentTimeMillis()

    println("Writing reversed partial files...")

    try {
        fun flush() {
            val bufferSize = lineBuffer.size
            if (bufferSize > 0) {
                lineCount += bufferSize
                tempFileCount++
                File("$tempFilePrefix-$tempFileCount").apply {
                    bufferedWriter().use { writer ->
                        ((bufferSize - 1) downTo 0).forEach { idx ->
                            writer.write(lineBuffer[idx])
                            writer.newLine()
                        }
                    }
                    tempFiles.add(this)
                }
                lineBuffer.clear()
            }

            println("  flushed at $lineCount lines")
        }

        // read and break into backword sorted chunks
        originalFile.bufferedReader(bufferSize = 4096 * 32)
                .lineSequence()
                .takeWhile { lineCount <= maxLines }.forEach { line ->
                    lineBuffer.add(line)
                    if (lineBuffer.size >= maxBufferSize) flush()
                }
        flush()

        // read backword sorted chunks backwards
        println("Reading reversed lines ...")
        tempFiles.reversed().forEach { tempFile ->
            tempFile.bufferedReader(bufferSize = 4096 * 32).lineSequence()
                .forEach { line ->
                    approxCharCount += line.length
                    // a line has been read here
                }
            println("  file $tempFile current char total $approxCharCount")
        }
    } finally {
        tempFiles.forEach { it.delete() }
    }

    val elapsed =  System.currentTimeMillis() - startTime

    println("temp file count:   $tempFileCount")
    println("approx char count: $approxCharCount")
    println("total line count:  $lineCount")
    println()
    println("Elapsed:  ${elapsed}ms  ${elapsed / 1000}secs  ${elapsed / 1000 / 60}min  ")

    println("reading original file again:")
    val againStartTime = System.currentTimeMillis()
    var againLineCount = 0
    originalFile.bufferedReader(bufferSize = 4096 * 32)
            .lineSequence()
            .takeWhile { againLineCount <= maxLines }
            .forEach { againLineCount++ }
    val againElapsed =  System.currentTimeMillis() - againStartTime
    println("Elapsed:  ${againElapsed}ms  ${againElapsed / 1000}secs  ${againElapsed / 1000 / 60}min  ")
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM