简体   繁体   中英

Reading file takes too long

My application starts by parsing a ~100MB file from the SD card and takes minutes to do so. To put that in perspective, on my PC, parsing the same file takes seconds.

I started by naively implementing the parser using Matcher and Pattern , but DDMS told me that 90% of the time was spent computing regular expression. And it took more than half an hour to parse the file. The pattern is ridiculously simple, a line consists of:

ID (a number) <TAB> LANG (a 3-to-5 character string) <TAB> DATA (the rest)

I decided to try and use String.split . It didn't show significant improvements, probably because this function might use regular expression itself. At that point I decided to rewrite the parser entirely, and ended up on something like this:

protected Collection<Sentence> doInBackground( Void... params ) {
    BufferedReader reader = new BufferedReader( new FileReader( sentenceFile ) );

    String currentLine = null;
    while ( (currentLine = reader.readLine()) != null ) {
        treatLine( currentLine, allSentences );
    }

    reader.close();
    return allSentences;
}

private void treatLine( String line, Collection<Sentence> allSentences ) {
    char[] str = line.toCharArray();

    // ...
    // treat the array of chars into an id, a language and some data

    allSentences.add( new Sentence( id, lang, data ) );
}

And I noticed a huge boost. It took minutes instead of half-an-hour. But I wasn't satisfied with this so I profiled and realized that a bottleneck was BufferedReader.readLine . I wondered: it could be IO-bound, but it also could be that a lot of time is taken filling up an intermediary buffer I don't really need. So I rewrote the whole thing using FileReader directly:

protected Collection<Sentence> doInBackground( Void... params ) {
    FileReader reader = new FileReader( sentenceFile );
    int currentChar;
    while ( (currentChar = reader.read()) != -1 ) {
        // parse an id
        // ...            

        // parse a language
        while ( (currentChar = reader.read()) != -1 ) {
            // do some parsing stuff
        }

        // parse the sentence data
        while ( (currentChar = reader.read()) != -1 ) {
            // parse parse parse
        }

        allSentences.add( new Sentence( id, lang, data ) );
    }

    reader.close();
}

And I was quite surprised to realize that the performance was super bad. Most of the time is spent in FileReader.read , obviously. I guess reading just a char costs a lot.

Now I am a bit out of inspiration. Any tip?

Another option which might enhance performance is to use an InputStreamReader around a FileInputStream . You'll have to do the buffering yourself but that may most definitely increase performance. See this tutorial for more information - but do not follow it blindly. For instance as you're using the char array you can use an char array as a buffer (and send it to treatLine() when you've reached a new-line).

Yet another suggestion is to actually use Thread directly. Documentation on AsyncTask says (my intonation):

AsyncTask is designed to be a helper class around Thread and Handler and does not constitute a generic threading framework. AsyncTasks should ideally be used for short operations (a few seconds at the most.) If you need to keep threads running for long periods of time, it is highly recommended you use the various APIs provided by the java.util.concurrent pacakge such as Executor, ThreadPoolExecutor and FutureTask.

Also, getting a faster SD card will certainly help - this is probably the main reason for it being much slower than on a desktop. A normal HD can read maybe 60 MB/s and a slow SD card 2 MB/s.

I guess you need to keep the BufferedReader but may not use readline. FileReader reads stuff from SD card, which is slowest. BufferredReader read from memory, which is better. Your second approach increase the time you visit Filereader.read(), I guess that will not work.

If the readline() is time consuming, try something like:

   reader.read(char[] cbuf, int off, int len) 

Try to get a large chunk of data at one time.

Removing the BufferedReader made it worse. Of course. You do need the 'filling up an intermediary buffer'. It saves you 8191 out of 8192 system calls that you are doing per char with the FileReader directory. Buffered I/O is always faster. I don't know why you would ever have thought otherwise.

As @EJP has mentioned, you should use BufferedReader. But more fundamentally you are running on a mobile devices, it's not a PC. The Flash reading speed is nowhere near that of PC, the computing power is a fraction of a 4-core 8-thread i7 running at 3.5 GHz, and we haven't even consider what would running both the flash & the CPU at full speed do to the device's battery life.

So the real question you should ask yourself is, why do your app need to parse a 100 MB data? And if it needs to be parsed every time when it starts up, why can't you just parse it on a PC and so your users don't have to?

allSentences is an ArrayList ? If so, maybe the number of items in it are a lot, and it has to be resized many times. Try to init the array with a large capacity.

Each ArrayList instance has a capacity. The capacity is the size of the array used to store the elements in the list. It is always at least as large as the list size. As elements are added to an ArrayList, its capacity grows automatically. The details of the growth policy are not specified beyond the fact that adding an element has constant amortized time cost.

An application can increase the capacity of an ArrayList instance before adding a large number of elements using the ensureCapacity operation. This may reduce the amount of incremental reallocation. ArrayList

Other think you can try:

  • Use the NDK.
  • as @Anson Yao said, try to increase the size of the buffer
  • remove the treatLine function, to decrease the overhead, of calling a function

About file reading

Top-to-bottom, reading a character looks like this:

  1. in Java you request to read a character;
  2. it translates to reading a byte (usually, depending on the encoding) from the InputStream ;
  3. this goes to native code, where it is translated to an analogous operating system command to read one byte from the open file;
  4. then this one byte travels the same way back.

And when you read into a buffer , the same sequence of events happens, but many thousands more bytes are transfered in one pass.

From this you can certainly build an intuition why it is very slow to read one char at a time from a file.

About regular expressions

I can't see anything wrong with the Pattern and Matcher approach: if the expression is written right, and the Patern compiled only once and reused, it should be very fast.

String#split , as you suspect, also uses a regex, and recompiles it every time you call it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM