简体   繁体   中英

Optimal way to sort a txt file in Java

I've got a CSV file that I'm processing using the opencsv library. So I can read in each line. The particular transformation I need to do requires me to sort that file first before I run through it with the main portion of my java file.

eg

5423, blah2, blah
5323, blah3, blah
5423, blah4, blah
5444, blah5, blah
5423, blah6, blah

should become

5323, blah3, blah
5423, blah2, blah
5423, blah4, blah
5423, blah6, blah
5444, blah5, blah

etc..

The reason i need to do this is I'm combining all rows with the same id and outputting them to a new file.

Anything wrong with:

  1. Read each line of the csv with the opencsv library

  2. Add them to a 2 dimensional array

  3. Run some sort of sorting on this

  4. Loop through sorted array and output to file.

Any other ideas on this and what is the best way to sort the data?

Bit rusty on my Java.

UPDATE: To Clarify on the final output

It would look like:

5323, blah3, blah
5423, blah2!!blah4!!blah6, blah
5444, blah5, blah

This is a very simplified version of what I'm doing. It actually is needed for multi option fields in a JBase system. This is the requested file format.

There are over a 100,000 lines in the original file.

This will be run more than once and the speed it runs is important to me.

To accomplish the most recent request, I would highly suggest using Multimap in the google collection. Your code would look like:

CSVReader reader = ...;
CSVWriter writer = ...;

Multimap<String, String> results = TreeMultimap.create();

// read the file
String[] line;
for ((line = reader.readNext()) != null) {
    results.put(line[0], line[1]);
}

// output the file
Map<String, Collection<String>> mapView = results.asMap();
for (Map.Entry<String, Collection<String> entry : mapView.entries()) {
    String[] nextLine = new String[2];
    nextLine[0] = entry.getKey();
    nextLine[1] = formatCollection(entry.getValue());
    writer.writeNext(nextLine);
}

You need to use "blah\\n" as your line ender. If you care about speed, but not so much about having the entries sorted, you should benchmark against HashMultimap as well.

:

The most straightford way is to use the sort command in *nix (eg Linux and Mac OS), like

sort -n myfile.csv

Windows has a sort command as well, but would sort the lines alphabetically (ie '5,' would be placed before '13,' lines).

However, there is nothing wrong with the suggested solution. Instead of constructing the array and sorting it, you can also just use TreeSet .

EDIT: adding a note about Windows.

您是否尝试过使用Collections.sort()Comparator实例?

If you are only interested in sorting on the id, and aren't bothered about the ordering within that id, you could simply combine a MultiValueMap from Commons Collections with a TreeMap:

MultiValueMap m = MultiValueMap.decorate(new TreeMap());

m.put(2, "B");
m.put(3, "Y");
m.put(1, "F");
m.put(1, "E");
m.put(2, "K");
m.put(4, "Q");
m.put(3, "I");
m.put(1, "X");

for(Iterator iter = m.entrySet().iterator(); iter.hasNext(); ) {
    final Map.Entry entry = (Map.Entry)iter.next();
    System.out.println(entry.getKey() + ": " + entry.getValue());
}

Running this gives:

1: [F, E, X]
2: [B, K]
3: [Y, I]
4: [Q]

There is an overloaded decorate method which let you specify the collection type to use in the MultiValueMap. You could do something with this if you need to sort within the ID.

You could just use a single dimensioned ArrayList (or other collection) and have Java do sorting on it using Collections sort method. Everything else you described sounds pretty standard, though.

You say you need to "sort" the items, but your description sounds as if you need to group them. This could be done many ways; you might want to look into multimaps such as those offered by google collections ; or you could simply create a

HashMap<Long, List<String>>

and place each line into the relevant list as you read it. My preference in cases like this is two passes through the file, once to add a new ArrayList to each key, and a second pass to add each string to the list, but it's probably more efficient (just less simple) to use a single pass, wherein you check to see if the list is already in the map.

It sounds like you don't need to sort the entire thing. I am not sure how many lines you are going to have, but it seems like you could use some sort of hash based scheme. You can think of your files as buckets in a hashmap and after reading each line, determine which file it belongs to. Then you can further process each file. There are a couple ways you can do this.

  • If you won't have a lot of "keys", you can actually just keep all the keys in memory as keys in a hash map of string => string (A map that maps the key to filename the line belongs in).

  • If there are too many possible keys to keep in memory. You can try to bucket the lines into different files to help reduce the size of the files. Then you can keep each file in memory, which would allow you to dump the lines to a collection and sort. Or possibly use the first scheme I mentioned.

Does this make sense? I can probably elaborate more if you are confused. I imagine your keys will be made by somehow combining all the columns of your csv line.

This approach will be more scalable if your files get really big. You don't want to depend on having the entire file in memory, and sorting takes O(nlogn) time, whereas in theory, the hashing scheme is just O(n).

FlatPack is great for reading in files like that and sorting them. It also has options for exporting a data set to a file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM