简体   繁体   English

读取文件时间过长

[英]Reading file takes too long

My application starts by parsing a ~100MB file from the SD card and takes minutes to do so. 我的应用程序首先从SD卡中解析了一个约100MB的文件,然后花了几分钟时间。 To put that in perspective, on my PC, parsing the same file takes seconds. 为了弄清楚这一点,在我的PC上,解析同一文件需要几秒钟。

I started by naively implementing the parser using Matcher and Pattern , but DDMS told me that 90% of the time was spent computing regular expression. 我首先使用MatcherPattern天真地实现了解析器,但是DDMS告诉我90%的时间都花在了计算正则表达式上。 And it took more than half an hour to parse the file. 而且花了半个多小时来解析文件。 The pattern is ridiculously simple, a line consists of: 该模式非常简单,一行包括:

ID (a number) <TAB> LANG (a 3-to-5 character string) <TAB> DATA (the rest)

I decided to try and use String.split . 我决定尝试使用String.split It didn't show significant improvements, probably because this function might use regular expression itself. 它没有显示出明显的改进,可能是因为此函数本身可能使用了正则表达式。 At that point I decided to rewrite the parser entirely, and ended up on something like this: 到那时,我决定完全重写解析器,最终得到如下结果:

protected Collection<Sentence> doInBackground( Void... params ) {
    BufferedReader reader = new BufferedReader( new FileReader( sentenceFile ) );

    String currentLine = null;
    while ( (currentLine = reader.readLine()) != null ) {
        treatLine( currentLine, allSentences );
    }

    reader.close();
    return allSentences;
}

private void treatLine( String line, Collection<Sentence> allSentences ) {
    char[] str = line.toCharArray();

    // ...
    // treat the array of chars into an id, a language and some data

    allSentences.add( new Sentence( id, lang, data ) );
}

And I noticed a huge boost. 我注意到了巨大的推动力。 It took minutes instead of half-an-hour. 用了几分钟而不是半个小时。 But I wasn't satisfied with this so I profiled and realized that a bottleneck was BufferedReader.readLine . 但是我对此并不满意,因此我分析并意识到瓶颈是BufferedReader.readLine I wondered: it could be IO-bound, but it also could be that a lot of time is taken filling up an intermediary buffer I don't really need. 我想知道:这可能是受IO约束的,但也可能是花了很多时间来填充我真正不需要的中间缓冲区。 So I rewrote the whole thing using FileReader directly: 因此,我直接使用FileReader重写了整个过程:

protected Collection<Sentence> doInBackground( Void... params ) {
    FileReader reader = new FileReader( sentenceFile );
    int currentChar;
    while ( (currentChar = reader.read()) != -1 ) {
        // parse an id
        // ...            

        // parse a language
        while ( (currentChar = reader.read()) != -1 ) {
            // do some parsing stuff
        }

        // parse the sentence data
        while ( (currentChar = reader.read()) != -1 ) {
            // parse parse parse
        }

        allSentences.add( new Sentence( id, lang, data ) );
    }

    reader.close();
}

And I was quite surprised to realize that the performance was super bad. 当我意识到性能非常糟糕时,我感到非常惊讶。 Most of the time is spent in FileReader.read , obviously. 显然,大部分时间都花在FileReader.read上 I guess reading just a char costs a lot. 我想只读一个字符会花很多钱。

Now I am a bit out of inspiration. 现在我有点灵感了。 Any tip? 有小费吗?

Another option which might enhance performance is to use an InputStreamReader around a FileInputStream . 这可能会提高性能的另一种选择是使用InputStreamReader周围FileInputStream You'll have to do the buffering yourself but that may most definitely increase performance. 您必须自己进行缓冲,但这绝对可以提高性能。 See this tutorial for more information - but do not follow it blindly. 有关更多信息,请参见本教程 -但是不要盲目地跟随它。 For instance as you're using the char array you can use an char array as a buffer (and send it to treatLine() when you've reached a new-line). 例如,当您使用char数组时,可以将char数组用作缓冲区(并在到达treatLine()时将其发送到treatLine() )。

Yet another suggestion is to actually use Thread directly. 另一个建议是直接使用Thread Documentation on AsyncTask says (my intonation): 关于AsyncTask 文档说(我的语气):

AsyncTask is designed to be a helper class around Thread and Handler and does not constitute a generic threading framework. AsyncTask被设计为围绕Thread和Handler的帮助器类,并且不构成通用的线程框架。 AsyncTasks should ideally be used for short operations (a few seconds at the most.) If you need to keep threads running for long periods of time, it is highly recommended you use the various APIs provided by the java.util.concurrent pacakge such as Executor, ThreadPoolExecutor and FutureTask. 理想情况下,应将AsyncTasks用于较短的操作(最多几秒钟)。如果需要使线程长时间运行,则强烈建议您使用java.util.concurrent pacakge提供的各种API,例如执行程序,ThreadPoolExecutor和FutureTask。

Also, getting a faster SD card will certainly help - this is probably the main reason for it being much slower than on a desktop. 另外,获得更快的SD卡肯定会有所帮助-这可能是其比台式机慢得多的主要原因。 A normal HD can read maybe 60 MB/s and a slow SD card 2 MB/s. 普通的HD可以读取60 MB / s,而慢速的SD卡可以读取2 MB / s。

I guess you need to keep the BufferedReader but may not use readline. 我想您需要保留BufferedReader,但可能不使用readline。 FileReader reads stuff from SD card, which is slowest. FileReader从最慢的SD卡读取内容。 BufferredReader read from memory, which is better. BufferredReader从内存中读取,效果更好。 Your second approach increase the time you visit Filereader.read(), I guess that will not work. 第二种方法增加了您访问Filereader.read()的时间,我想这行不通。

If the readline() is time consuming, try something like: 如果readline()很耗时,请尝试以下操作:

   reader.read(char[] cbuf, int off, int len) 

Try to get a large chunk of data at one time. 尝试一次获取大量数据。

Removing the BufferedReader made it worse. 删除BufferedReader会使情况变得更糟。 Of course. 当然。 You do need the 'filling up an intermediary buffer'. 确实需要“填满中间缓冲区”。 It saves you 8191 out of 8192 system calls that you are doing per char with the FileReader directory. 使用FileReader目录每个字符时,它可以节省8191个系统调用中的8191个。 Buffered I/O is always faster. 缓冲的I / O总是更快。 I don't know why you would ever have thought otherwise. 我不知道你为什么会想到别的。

As @EJP has mentioned, you should use BufferedReader. 如@EJP所述,您应该使用BufferedReader。 But more fundamentally you are running on a mobile devices, it's not a PC. 但从根本上讲,您正在移动设备上运行,而不是PC。 The Flash reading speed is nowhere near that of PC, the computing power is a fraction of a 4-core 8-thread i7 running at 3.5 GHz, and we haven't even consider what would running both the flash & the CPU at full speed do to the device's battery life. 闪存的读取速度远远不能与PC相比,其计算能力仅是运行于3.5 GHz的4核8线程i7的一小部分,我们甚至都没有考虑过以全速运行闪存和CPU的情况。会影响设备的电池寿命。

So the real question you should ask yourself is, why do your app need to parse a 100 MB data? 因此,您应该问自己的真正问题是,为什么您的应用程序需要解析100 MB的数据? And if it needs to be parsed every time when it starts up, why can't you just parse it on a PC and so your users don't have to? 而且,如果每次启动时都需要对其进行解析,那么为什么不可以仅在PC上对其进行解析,而不必让用户这样做呢?

allSentences is an ArrayList ? allSentences是ArrayList吗? If so, maybe the number of items in it are a lot, and it has to be resized many times. 如果是这样,则其中的项目数量可能很多,并且必须多次调整大小。 Try to init the array with a large capacity. 尝试初始化大容量的阵列。

Each ArrayList instance has a capacity. 每个ArrayList实例都有一个容量。 The capacity is the size of the array used to store the elements in the list. 容量是用于在列表中存储元素的数组的大小。 It is always at least as large as the list size. 它总是至少与列表大小一样大。 As elements are added to an ArrayList, its capacity grows automatically. 将元素添加到ArrayList后,其容量会自动增长。 The details of the growth policy are not specified beyond the fact that adding an element has constant amortized time cost. 除了添加元素具有固定的摊销时间成本外,没有指定增长策略的详细信息。

An application can increase the capacity of an ArrayList instance before adding a large number of elements using the ensureCapacity operation. 应用程序可以使用sureCapacity操作在添加大量元素之前增加ArrayList实例的容量。 This may reduce the amount of incremental reallocation. 这可以减少增量重新分配的数量。 ArrayList 数组列表

Other think you can try: 其他人认为您可以尝试:

  • Use the NDK. 使用NDK。
  • as @Anson Yao said, try to increase the size of the buffer 正如@Anson Yao所说的,尝试增加缓冲区的大小
  • remove the treatLine function, to decrease the overhead, of calling a function 删除调用函数的TreatLine函数,以减少开销

About file reading 关于文件读取

Top-to-bottom, reading a character looks like this: 从上到下,读取字符看起来像这样:

  1. in Java you request to read a character; 在Java中,您要求读取字符;
  2. it translates to reading a byte (usually, depending on the encoding) from the InputStream ; 它转换为从InputStream读取一个字节(通常取决于编码);
  3. this goes to native code, where it is translated to an analogous operating system command to read one byte from the open file; 然后转到本机代码,在本机代码中将其转换为类似的操作系统命令,以从打开的文件中读取一个字节;
  4. then this one byte travels the same way back. 然后这一个字节以相同的方式返回。

And when you read into a buffer , the same sequence of events happens, but many thousands more bytes are transfered in one pass. 当您读入缓冲区时 ,会发生相同的事件序列,但是一次传递了数千个字节。

From this you can certainly build an intuition why it is very slow to read one char at a time from a file. 由此,您当然可以建立一个直觉,为什么一次从一个文件读取一个字符很慢。

About regular expressions 关于正则表达式

I can't see anything wrong with the Pattern and Matcher approach: if the expression is written right, and the Patern compiled only once and reused, it should be very fast. 我看不到使用Pattern and Matcher方法有什么问题:如果表达式编写正确,并且Patern仅编译一次并重用,那么它应该非常快。

String#split , as you suspect, also uses a regex, and recompiles it every time you call it. 您怀疑String#split也使用一个正则表达式,并在每次调用它时对其进行重新编译。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM