My application starts by parsing a ~100MB file from the SD card and takes minutes to do so. To put that in perspective, on my PC, parsing the same file takes seconds.
I started by naively implementing the parser using Matcher and Pattern , but DDMS told me that 90% of the time was spent computing regular expression. And it took more than half an hour to parse the file. The pattern is ridiculously simple, a line consists of:
ID (a number) <TAB> LANG (a 3-to-5 character string) <TAB> DATA (the rest)
I decided to try and use String.split . It didn't show significant improvements, probably because this function might use regular expression itself. At that point I decided to rewrite the parser entirely, and ended up on something like this:
protected Collection<Sentence> doInBackground( Void... params ) {
BufferedReader reader = new BufferedReader( new FileReader( sentenceFile ) );
String currentLine = null;
while ( (currentLine = reader.readLine()) != null ) {
treatLine( currentLine, allSentences );
}
reader.close();
return allSentences;
}
private void treatLine( String line, Collection<Sentence> allSentences ) {
char[] str = line.toCharArray();
// ...
// treat the array of chars into an id, a language and some data
allSentences.add( new Sentence( id, lang, data ) );
}
And I noticed a huge boost. It took minutes instead of half-an-hour. But I wasn't satisfied with this so I profiled and realized that a bottleneck was BufferedReader.readLine . I wondered: it could be IO-bound, but it also could be that a lot of time is taken filling up an intermediary buffer I don't really need. So I rewrote the whole thing using FileReader directly:
protected Collection<Sentence> doInBackground( Void... params ) {
FileReader reader = new FileReader( sentenceFile );
int currentChar;
while ( (currentChar = reader.read()) != -1 ) {
// parse an id
// ...
// parse a language
while ( (currentChar = reader.read()) != -1 ) {
// do some parsing stuff
}
// parse the sentence data
while ( (currentChar = reader.read()) != -1 ) {
// parse parse parse
}
allSentences.add( new Sentence( id, lang, data ) );
}
reader.close();
}
And I was quite surprised to realize that the performance was super bad. Most of the time is spent in FileReader.read , obviously. I guess reading just a char costs a lot.
Now I am a bit out of inspiration. Any tip?
Another option which might enhance performance is to use an InputStreamReader
around a FileInputStream
. You'll have to do the buffering yourself but that may most definitely increase performance. See this tutorial for more information - but do not follow it blindly. For instance as you're using the char array you can use an char array as a buffer (and send it to treatLine()
when you've reached a new-line).
Yet another suggestion is to actually use Thread
directly. Documentation on AsyncTask
says (my intonation):
AsyncTask is designed to be a helper class around Thread and Handler and does not constitute a generic threading framework. AsyncTasks should ideally be used for short operations (a few seconds at the most.) If you need to keep threads running for long periods of time, it is highly recommended you use the various APIs provided by the java.util.concurrent pacakge such as Executor, ThreadPoolExecutor and FutureTask.
Also, getting a faster SD card will certainly help - this is probably the main reason for it being much slower than on a desktop. A normal HD can read maybe 60 MB/s and a slow SD card 2 MB/s.
I guess you need to keep the BufferedReader but may not use readline. FileReader reads stuff from SD card, which is slowest. BufferredReader read from memory, which is better. Your second approach increase the time you visit Filereader.read(), I guess that will not work.
If the readline() is time consuming, try something like:
reader.read(char[] cbuf, int off, int len)
Try to get a large chunk of data at one time.
Removing the BufferedReader made it worse. Of course. You do need the 'filling up an intermediary buffer'. It saves you 8191 out of 8192 system calls that you are doing per char with the FileReader
directory. Buffered I/O is always faster. I don't know why you would ever have thought otherwise.
As @EJP has mentioned, you should use BufferedReader. But more fundamentally you are running on a mobile devices, it's not a PC. The Flash reading speed is nowhere near that of PC, the computing power is a fraction of a 4-core 8-thread i7 running at 3.5 GHz, and we haven't even consider what would running both the flash & the CPU at full speed do to the device's battery life.
So the real question you should ask yourself is, why do your app need to parse a 100 MB data? And if it needs to be parsed every time when it starts up, why can't you just parse it on a PC and so your users don't have to?
allSentences is an ArrayList ? If so, maybe the number of items in it are a lot, and it has to be resized many times. Try to init the array with a large capacity.
Each ArrayList instance has a capacity. The capacity is the size of the array used to store the elements in the list. It is always at least as large as the list size. As elements are added to an ArrayList, its capacity grows automatically. The details of the growth policy are not specified beyond the fact that adding an element has constant amortized time cost.
An application can increase the capacity of an ArrayList instance before adding a large number of elements using the ensureCapacity operation. This may reduce the amount of incremental reallocation. ArrayList
Other think you can try:
Top-to-bottom, reading a character looks like this:
InputStream
; And when you read into a buffer , the same sequence of events happens, but many thousands more bytes are transfered in one pass.
From this you can certainly build an intuition why it is very slow to read one char at a time from a file.
I can't see anything wrong with the Pattern
and Matcher
approach: if the expression is written right, and the Patern
compiled only once and reused, it should be very fast.
String#split
, as you suspect, also uses a regex, and recompiles it every time you call it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.