简体   繁体   中英

Fast sort by date of huge file ArrayList

I have an ArrayList in Java with a huge amount of files (~40.000 files). I need to sort these files ascending/descending by their date. Currently, I use a simple

Collections.sort(fileList, new FileDateComparator());

where FileDateComparator is

public class FileDateComparator implements Comparator<File>
{
       @Override
       public int compare(File o1, File o2)
       {
           if(o1.lastModified() < o2.lastModified())
               return -1;
           if(o1.lastModified()==o2.lastModified())
               return 0;
          return 1;
       }
    }

Sorting takes up up a too long time for me, like 20 seconds or more. Is there a more efficient way to realize this? I already tried Apache I/O LastModifiedFileComparator as comparator, but it seems to be implemented the same way, since it takes the same time.

I think you need to cache the modification times to speed this up. You could eg try something like this:

class DatedFile {
  File f;
  long moddate;

  public DatedFile(File f, long moddate) {
    this.f = f;
    this.moddate = moddate;
  }
};


ArrayList<DatedFile> datedFiles = new ArrayList<DatedFile>();
for (File f: fileList) {
  datedFiles.add(new DatedFile(f, f.lastModified()));
}
Collections.sort(fileList, new FileDateComparator());
ArrayList<File> sortedFiles = new ArrayList<File>();
for (DatedFile f: datedFiles) {
  sortedFiles.add(f.f);
}

(with an appropriate FileDateComparator implementation)

Sorting is O(n lg N), so your list of 40,000 files will need about 600,000 operations (comparisons). If it takes about 20 seconds, that is about 30,000 comparisons per second. So each comparison is taking about 100,000 clock cycles. That can not be due to CPU-bound processing. The sorting is almost certainly I/O bound rather than CPU bound. Disk seeks are particularly expensive.

You might be able to reduce the time by using multi-threading to reduce the impact of disk seeks. That is, by having several reads queued and waiting for the disk drive to provide their data. To do that, use a (concurrent) map that maps file names to modification times, and populate that map using multiple threads. Then have your sort method use that map rather than use File.lastModified() itself.

Even if you populated that map with only one thread, you would gain a little benefit because your sort method would be using locally cached modification times, rather than querying the O/S every time for the modification times. The benefit of that caching might not be large, because the O/S itself is likely to cache that information.

Java's array .sort() is (from about Java 6) actually TimSort [ http://svn.python.org/projects/python/trunk/Objects/listsort.txt ], the fastest general purpose #sort out there (much better than qsort in many situations); you won't be able to sort anything noticeably faster without a heuristic .

"like 20 seconds or more" signifies to me that your problem is probably the famous ApplicationProfilingSkippedByDeveloperException - do a profiling and locate the exact bottleneck. I'd go with the OS file I/O as one; doing a native request of the file attributes in batch, caching the results and then processing them at once seems the only sensible solution here.

You need to cache the lastModified() One way you can do this is in the Comparator itself.

public class FileDateComparator implements Comparator<File> {
   Map<File, Long> lastModifiedMap = new HashMap<>();

   Long lastModified(File f) {
       Long ts = lastModifiedMap.get(f);
       if (ts == null)
           lastModifiedMap.put(f, ts = f.lastModified());
       return ts;
   }

   @Override
   public int compare(File f1, File f2) {
       return lastModified(f1).compareTo(lastModified(f2));
   }
}

This will improve performance by only looking up the modified date of each file once.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM