简体   繁体   中英

Using RandomAccessFile along with BufferedReader to speed up file read

I have to :-

  • Read large text file line by line.
  • Note down file pointer position after every line read.
  • Stop the file read if running time is greater than 30 seconds.
  • Resume from last noted file pointer in a new process.

What I am doing :

  1. Using RandomAccessFile.getFilePointer() to note the file pointer.
  2. Wrap RandomAccessFile into another BufferedReader to speed up file read process as per this answer.
  3. When time is greater than 30 seconds, I stop reading the file. Restarting the process with new RandomAccessFile and using RandomAccessFile.seek method to move file pointer to where I left.

Problem:

As I am reading through BufferedReader wrapped around RandomAccessFile, it seems file pointer is moving far ahead in a single call to BufferedReader.readLine(). However, if I use RandomAccessFile.readLine() directely, file pointer is moving properly step by step in forward direction.

Using BufferedReader as a wrapper :

    RandomAccessFile randomAccessFile = new RandomAccessFile("mybigfile.txt", "r");
BufferedReader brRafReader = new BufferedReader(new FileReader(randomAccessFile.getFD()));
while((line = brRafReader.readLine()) != null) {
    System.out.println(line+", Position : "+randomAccessFile.getFilePointer());
}

Output:

Line goes here, Position : 13040
Line goes here, Position : 13040
Line goes here, Position : 13040
Line goes here, Position : 13040

Using Direct RandomAccessFile.readLine

    RandomAccessFile randomAccessFile = new RandomAccessFile("mybigfile.txt", "r");
while((line = randomAccessFile.readLine()) != null) {
    System.out.println(line+", Position : "+randomAccessFile.getFilePointer());
}

Output: (This is as expected. File pointer moving properly with each call to readline)

Line goes here, Position : 11011
Line goes here, Position : 11089
Line goes here, Position : 12090
Line goes here, Position : 13040

Could anyone tell, what wrong am I doing here ? Is there any way I can speed up reading process using RandomAccessFile ?

The reason for the observed behavior is that, as the name suggests, the BufferedReader is buffered . It reads a larger chunk of data at once (into a buffer), and returns only the relevant parts of the buffer contents - namely, the part up to the next \\n line separator.

I think there are, broadly speaking, two possible approaches:

  1. You could implement your own buffering logic.
  2. Using some ugly reflection hack to obtain the required buffer offset

For 1., you would no longer use RandomAccessFile#readLine . Instead, you'd do your own buffering via

byte buffer[] = new byte[8192];
...
// In a loop:
int read = randomAccessFile.read(buffer);
// Figure out where a line break `\n` appears in the buffer,
// return the resulting lines, and take the position of the `\n`
// into account when storing the "file pointer"

As the vague comment indicates: This may be cumbersome and fiddly. You'd basically re-implement what the readLine method does in the BufferedReader class. And at this point, I don't even want to mention the headaches that different line separators or character sets could cause.

For 2., you could simply access the field of the BufferedReader that stores the buffer offset. This is implemented in the example below. Of course, this is a somewhat crude solution, but mentioned and shown here as a simple alternative, depending on how "sustainable" the solution should be and how much effort you are willing to invest.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.RandomAccessFile;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.List;

public class LargeFileRead {
    public static void main(String[] args) throws Exception {

        String fileName = "myBigFile.txt";

        long before = System.nanoTime();
        List<String> result = readBuffered(fileName);
        //List<String> result = readDefault(fileName);
        long after = System.nanoTime();
        double ms = (after - before) / 1e6;
        System.out.println("Reading took " + ms + "ms "
                + "for " + result.size() + " lines");
    }

    private static List<String> readBuffered(String fileName) throws Exception {
        List<String> lines = new ArrayList<String>();
        RandomAccessFile randomAccessFile = new RandomAccessFile(fileName, "r");
        BufferedReader brRafReader = new BufferedReader(
                new FileReader(randomAccessFile.getFD()));
        String line = null;
        long currentOffset = 0;
        long previousOffset = -1;
        while ((line = brRafReader.readLine()) != null) {
            long fileOffset = randomAccessFile.getFilePointer();
            if (fileOffset != previousOffset) {
                if (previousOffset != -1) {
                    currentOffset = previousOffset;
                }
                previousOffset = fileOffset;
            }
            int bufferOffset = getOffset(brRafReader);
            long realPosition = currentOffset + bufferOffset;
            System.out.println("Position : " + realPosition 
                    + " with FP " + randomAccessFile.getFilePointer()
                    + " and offset " + bufferOffset);
            lines.add(line);
        }
        return lines;
    }

    private static int getOffset(BufferedReader bufferedReader) throws Exception {
        Field field = BufferedReader.class.getDeclaredField("nextChar");
        int result = 0;
        try {
            field.setAccessible(true);
            result = (Integer) field.get(bufferedReader);
        } finally {
            field.setAccessible(false);
        }
        return result;
    }

    private static List<String> readDefault(String fileName) throws Exception {
        List<String> lines = new ArrayList<String>();
        RandomAccessFile randomAccessFile = new RandomAccessFile(fileName, "r");
        String line = null;
        while ((line = randomAccessFile.readLine()) != null) {
            System.out.println("Position : " + randomAccessFile.getFilePointer());
            lines.add(line);
        }
        return lines;
    }
}

(Note: The offsets may still appear to be off by 1, but this is due to the line separator not being taken into account in the position. This could be adjusted if necessary)

NOTE: This is only a sketch. The RandomAccessFile objects should be closed properly when reading is finished, but that depends on how the reading is supposed to be interrupted when the time limit is exceeded, as described in the question

BufferedReader reads a block of data from the file, 8 KB by default. Finding line breaks on order to return the next line is done in the buffer.

I guess, this is why you see a huge increment in the physical file position.

RandomAccessFile will not be using a buffer when reading the next line. It will read byte after byte. That's really slow.

How is performance when you just use a BufferedReader and remember the line you need to continue from?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM