简体   繁体   English

在Java中对txt文件进行排序的最佳方法

[英]Optimal way to sort a txt file in Java

I've got a CSV file that I'm processing using the opencsv library. 我有一个正在使用opencsv库处理的CSV文件。 So I can read in each line. 这样我就可以阅读每一行。 The particular transformation I need to do requires me to sort that file first before I run through it with the main portion of my java file. 我需要执行的特定转换要求我先对文件进行排序,然后再使用Java文件的主要部分对其进行遍历。

eg 例如

5423, blah2, blah
5323, blah3, blah
5423, blah4, blah
5444, blah5, blah
5423, blah6, blah

should become 应该成为

5323, blah3, blah
5423, blah2, blah
5423, blah4, blah
5423, blah6, blah
5444, blah5, blah

etc.. 等等..

The reason i need to do this is I'm combining all rows with the same id and outputting them to a new file. 我需要这样做的原因是我将所有具有相同ID的行组合在一起,并将其输出到新文件中。

Anything wrong with: 出现任何问题:

  1. Read each line of the csv with the opencsv library 使用opencsv库读取csv的每一行

  2. Add them to a 2 dimensional array 将它们添加到二维数组

  3. Run some sort of sorting on this 在此进行某种排序

  4. Loop through sorted array and output to file. 循环遍历已排序的数组并输出到文件。

Any other ideas on this and what is the best way to sort the data? 还有其他想法吗?对数据进行排序的最佳方法是什么?

Bit rusty on my Java. 我的Java有点生锈。

UPDATE: To Clarify on the final output 更新:澄清最终输出

It would look like: 它看起来像:

5323, blah3, blah
5423, blah2!!blah4!!blah6, blah
5444, blah5, blah

This is a very simplified version of what I'm doing. 这是我正在做的非常简化的版本。 It actually is needed for multi option fields in a JBase system. 实际上,JBase系统中的多选项字段需要使用它。 This is the requested file format. 这是请求的文件格式。

There are over a 100,000 lines in the original file. 原始文件中有超过100,000行。

This will be run more than once and the speed it runs is important to me. 这将运行多次,并且运行速度对我很重要。

To accomplish the most recent request, I would highly suggest using Multimap in the google collection. 为了完成最新的请求,我强烈建议在Google集合中使用Multimap Your code would look like: 您的代码如下所示:

CSVReader reader = ...;
CSVWriter writer = ...;

Multimap<String, String> results = TreeMultimap.create();

// read the file
String[] line;
for ((line = reader.readNext()) != null) {
    results.put(line[0], line[1]);
}

// output the file
Map<String, Collection<String>> mapView = results.asMap();
for (Map.Entry<String, Collection<String> entry : mapView.entries()) {
    String[] nextLine = new String[2];
    nextLine[0] = entry.getKey();
    nextLine[1] = formatCollection(entry.getValue());
    writer.writeNext(nextLine);
}

You need to use "blah\\n" as your line ender. 您需要使用"blah\\n"作为线路发送者。 If you care about speed, but not so much about having the entries sorted, you should benchmark against HashMultimap as well. 如果您关心速度,而不是关心条目的排序,那么您也应该以HashMultimap为基准。

My previous answer : 我以前的回答

The most straightford way is to use the sort command in *nix (eg Linux and Mac OS), like 最直接的方法是在* nix(例如Linux和Mac OS)中使用sort命令,例如

sort -n myfile.csv

Windows has a sort command as well, but would sort the lines alphabetically (ie '5,' would be placed before '13,' lines). Windows也有一个sort命令,但是会按字母顺序对行进行排序(即“ 5”将放置在“ 13”行之前)。

However, there is nothing wrong with the suggested solution. 但是,建议的解决方案没有错。 Instead of constructing the array and sorting it, you can also just use TreeSet . 除了构造数组并对其进行排序之外,您还可以使用TreeSet

EDIT: adding a note about Windows. 编辑:添加有关Windows的注释。

您是否尝试过使用Collections.sort()Comparator实例?

If you are only interested in sorting on the id, and aren't bothered about the ordering within that id, you could simply combine a MultiValueMap from Commons Collections with a TreeMap: 如果您只对ID排序感兴趣,而不必担心ID内的排序,则可以将Commons Collections中的MultiValueMap与TreeMap结合使用:

MultiValueMap m = MultiValueMap.decorate(new TreeMap());

m.put(2, "B");
m.put(3, "Y");
m.put(1, "F");
m.put(1, "E");
m.put(2, "K");
m.put(4, "Q");
m.put(3, "I");
m.put(1, "X");

for(Iterator iter = m.entrySet().iterator(); iter.hasNext(); ) {
    final Map.Entry entry = (Map.Entry)iter.next();
    System.out.println(entry.getKey() + ": " + entry.getValue());
}

Running this gives: 运行此命令可获得:

1: [F, E, X]
2: [B, K]
3: [Y, I]
4: [Q]

There is an overloaded decorate method which let you specify the collection type to use in the MultiValueMap. 有一个重载的decorate方法,可让您指定要在MultiValueMap中使用的集合类型。 You could do something with this if you need to sort within the ID. 如果您需要在ID中进行排序,则可以对此进行处理。

You could just use a single dimensioned ArrayList (or other collection) and have Java do sorting on it using Collections sort method. 您可以只使用一个维数ArrayList(或其他集合),并让Java使用Collections sort方法对其进行排序。 Everything else you described sounds pretty standard, though. 但是,您描述的所有其他内容听起来都很标准。

You say you need to "sort" the items, but your description sounds as if you need to group them. 您说您需要对项目进行“分类”,但是您的描述听起来好像需要对它们进行分组 This could be done many ways; 这可以通过多种方式完成; you might want to look into multimaps such as those offered by google collections ; 您可能想要研究多图,例如google集合提供的多图; or you could simply create a 或者您可以简单地创建一个

HashMap<Long, List<String>>

and place each line into the relevant list as you read it. 并在阅读时将每一行放入相关列表中。 My preference in cases like this is two passes through the file, once to add a new ArrayList to each key, and a second pass to add each string to the list, but it's probably more efficient (just less simple) to use a single pass, wherein you check to see if the list is already in the map. 在这种情况下,我的首选是两次通过文件,一次是向每个键添加一个新的ArrayList,另一次是将每个字符串添加到列表中,但是使用一次通过可能会更高效(只是简单一点) ,其中您检查列表是否在地图中。

It sounds like you don't need to sort the entire thing. 听起来您不需要对整个事物进行排序。 I am not sure how many lines you are going to have, but it seems like you could use some sort of hash based scheme. 我不确定要多少行,但似乎可以使用某种基于哈希的方案。 You can think of your files as buckets in a hashmap and after reading each line, determine which file it belongs to. 您可以将您的文件视为哈希图中的存储桶,并在读取每一行后确定其属于哪个文件。 Then you can further process each file. 然后,您可以进一步处理每个文件。 There are a couple ways you can do this. 您可以通过几种方法来执行此操作。

  • If you won't have a lot of "keys", you can actually just keep all the keys in memory as keys in a hash map of string => string (A map that maps the key to filename the line belongs in). 如果您没有很多“键”,则实际上可以将所有键作为键保留在内存中,这些键作为string => string的哈希映射(将键映射到该行所属的文件名的映射)。

  • If there are too many possible keys to keep in memory. 如果有太多可能的密钥要保留在内存中。 You can try to bucket the lines into different files to help reduce the size of the files. 您可以尝试将行存储到不同的文件中,以帮助减小文件的大小。 Then you can keep each file in memory, which would allow you to dump the lines to a collection and sort. 然后,您可以将每个文件保留在内存中,这将使您可以将行转储到集合中并进行排序。 Or possibly use the first scheme I mentioned. 或者可能使用我提到的第一个方案。

Does this make sense? 这有意义吗? I can probably elaborate more if you are confused. 如果您感到困惑,我可能会详细说明。 I imagine your keys will be made by somehow combining all the columns of your csv line. 我想您的键将通过某种方式组合csv行的所有列来完成。

This approach will be more scalable if your files get really big. 如果文件很大,此方法将具有更大的可伸缩性。 You don't want to depend on having the entire file in memory, and sorting takes O(nlogn) time, whereas in theory, the hashing scheme is just O(n). 您不想依赖于将整个文件存储在内存中,排序需要O(nlogn)时间,而从理论上讲,哈希方案只是O(n)。

FlatPack is great for reading in files like that and sorting them. FlatPack非常适合读取此类文件并对其进行排序。 It also has options for exporting a data set to a file. 它还具有用于将数据集导出到文件的选项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM