简体   繁体   中英

Merging multiple sorted csv files with complex comparison

I have list of sorted,csv,files that I want to sort and merge into output file.

I don't want to do a simple comparison of strings, but comparison accordingly to map of types that i have for every value, eg:

One of the lines:
1, 15/12/2011, David Raiven, New York

In the type map I have this: first column - long, second- date, third-string,...

So the comparator should compare values accordingly.

How can i do it with highest efficiency?
PriorityQueue? TreeMap?

I prefer not to use 3rd party libraries or sorters.
The input file is enormous.

Create an array (or, if you prefer, a Collection) of Readers/InputStreams, one for each CSV file.

Similar to @JustinKSU idea, create a TreeMap, where the key is one line from the CSV file. Pass a custom Comparator, your custom impl that sorts by long, Date etc. The value is the index (probably an Integer, could be the filename if your Collection is a Map) of which file in your array/Collection.

Seed the TreeMap by reading the first line from each file.

Remove the lowest line using TreeMap.pollFirstEntry(), and write the key (the line) to a Writer/OutputStream. Use the value to read one more line from the appropriate file (checking for EOF) and add that into the TreeMap.

Repeat until TreeMap is empty. Close everything.

Edit - Added Source Code below

And Note, this only works if the input files are already sorted ! (As was specified in the question)

public void mergeSort(File[] inFiles, File outFile, Comparator<String> comparator) throws IOException  {

      try {
         BufferedReader[] readers = new BufferedReader[inFiles.length];
         PrintWriter writer = new PrintWriter(outFile);
         TreeMap<String, Integer> treeMap = new TreeMap<String, Integer>(
               comparator);

         // read first line of each file. We don't check for EOF here, probably should
         for (int i = 0; i < inFiles.length; i++) {
            readers[i] = new BufferedReader(new FileReader(inFiles[i]));
            String line = readers[i].readLine();
            treeMap.put(line, Integer.valueOf(i));
         }

         while (!treeMap.isEmpty()) {
            Map.Entry<String, Integer> nextToGo = treeMap.pollFirstEntry();
            int fileIndex = nextToGo.getValue().intValue();
            writer.println(nextToGo.getKey());

            String line = readers[fileIndex].readLine();
            if (line != null)
               treeMap.put(line, Integer.valueOf(fileIndex));
         }
      }
      finally {
         // close everything here...
      }
   }

If you want to do it all in memory, I would recommend a TreeSet passing in your Comparitor. That would be the simplest implementation. If you can't store it all in memory, you could open InputStreams to all your files and loop though each until you determine the "lowest" value and output it to your new file.

One, maybe a bit unorthodox option, would be to use an on-the-fly database, such as HSQLDB for example. Open a database somewhere in a temp directory where you have enough space, create the table with needed fields, insert all records from all CSV files, and finally do a select from all records with an appropriate ORDER BY clause that reflects your desired sorting order and save the results where you want. Of course this will need a bit of disk space, but it is a possible solution that I have used in the past for similar problems.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM