简体   繁体   中英

Reading huge file in Java

I read a huge File (almost 5 million lines). Each line contains Date and a Request , I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Thread s for parsing lines, because it takes a lot of time?

A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.

When you call executor.execute(future) , the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.

To retrieve the result of a future, call future.get() . When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.

Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.

tl;dr pseudocode:

 create executor
 for each line in file
     create new FutureTask which parses that line
     pass future task to executor
     add future task to a list
 for each entry in task list
     call entry.get() to retrieve result
 executor.shutdown()

It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.

If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.

If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.

Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.

Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.

Something of the form

HashMap<Date, ArrayList<String>> ()

would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.

You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.

See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html

And http://en.wikipedia.org/wiki/Memory-mapped_file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM