简体   繁体   中英

Fastest way of writing the first 10 000 lines of data file to new file

I want the first ten thousand lines of a hyuuge (.csv) file.

The naive way of

1) creating a reader & writer

2) reading the original file line for line

3) writing the first ten thousand lines to a new file

can't be the fastest, can it?

This will be a common operation in my app so I'm slightly concerned about speed, but also just curious.

Thanks.

There are a few ways of doing fast I/O in Java but without benchmarking for your particular case, it's kind of difficult to shoot out a figure/advice. Here are a few ways you can try benchmarking:

  • Buffered reader/writers with maybe varying buffer sizes
  • Reading the entire file in memory (if it can be) and doing an in-memory split and writing it all in a single go
  • Using NIO file API for reading/writing files (look into Channels)

If you only want to read/write 10,000 lines or so:

  • it will probably take longer to start up a new JVM than to read / write the file,
  • the read / write time should be a fraction of a second ... doing it the naive way, and
  • the overall speed up from a copying algorithm is unlikely to be worthwhile.

Having said that, you can do better than reading a line at a time using BufferedReader.readLine() or whatever.

  • Depending on the character encoding of the file, you will get better performance by doing byte-wise I/O with a BufferedInputStream and BufferedOutputStream with large buffer sizes. Just write a loop to read a byte, conditionally update the line counter and write the byte ... until you have copied the requisite number of lines. (This assumes that you can detect the CR and/or LF characters by examining the bytes. This is true for all character encodings I know about.)

  • If you use NIO and ByteBuffers, you can further reduce the amount of in-memory copying, though the CR / LF counting logic will be more complicated.

But the first question you should ask is whether it is even worthwhile bothering to optimize this.

Are the lines the same length. If so you can use RandomAccessFile to read x bytes and then write those bytes to a new file. It may be quite memory intensive though. I suspect this would be quicker but probably worth benchmarking. This solution would only work for fixed length lines

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM