删除重复项的好程序

Question

I've a huge string(1GB) with SPACE delimiter, I'll convert it into Array[]. 我有一个带有SPACE分隔符的巨大字符串（1GB），我将其转换为Array []。 My string contains lots of duplicates. 我的字符串包含很多重复项。 I've to sort the string and remove duplicates. 我必须对字符串进行排序并删除重复项。 I've made up 2 procedures and I'm not able to decide one among these two. 我已经完成了2个过程，但无法在这两个过程中决定一个。

Procedure 1 程序1

I assume that sorting string is costly process, I wanted to remove duplicates using HashSet and then sort. 我认为排序字符串是一个昂贵的过程，我想使用HashSet删除重复项，然后进行排序。

Procedure 2 程序2

I sort the Array and remove duplicates using formal procedure of comparing sorted Array with its previous value to next value and remove duplicates. 我对数组进行排序，并使用将已排序的数组及其上一个值与下一个值进行比较的正式程序来删除重复项，并删除重复项。

From my point of view, 1st procedure seems good. 从我的角度来看，第一步程序似乎不错。 But I'm not aware if I run into any errors. 但是我不知道是否遇到任何错误。 Which one will be good..? 哪个会好..？

Answer 1

Assuming memory is not an issue, the most efficient approach, performance-wise, is probably: 假设内存不是问题，从性能角度来看，最有效的方法可能是：

String s = someOneGbString();
String[] words = s.split("\\s+");
Set<String> noDupes = new HashSet<>();
Collections.addAll(noDupes, words);

And if you need it sorted: 如果您需要对它进行排序：

Set<String> sorted = new TreeSet<> (noDupes);

Or with Java 8: 或使用Java 8：

Set<String> sorted = Arrays.stream(s.split("\\s+"))
                           .sorted()
                           .collect(toSet());

Answer 2

Case 1: Memory < ~1GB 情况1：内存<〜1GB

You can use external merge sort. 您可以使用外部合并排序。 http://en.wikipedia.org/wiki/External_sorting#External_merge_sort http://en.wikipedia.org/wiki/External_sorting#External_merge_sort

Case 2: Memory > ~1GB 情况2：内存>〜1GB

Read the whole String. 读取整个字符串。 Split it into an array ( String[] ). 将其拆分为一个数组（ String[] ）。 Use in-place quicksort. 使用就地快速排序。 Iterate over the array and check if sequential neighboring strings are the same or not. 遍历数组，并检查顺序相邻的字符串是否相同。 As substrings are not copies of the original String but simply refers to the memory location in the String pool, this will be space efficient. 由于子字符串不是原始String的副本，而是仅引用String池中的内存位置，因此这将节省空间。

Time Complexity: O(nlogn) 时间复杂度：O（nlogn）

Case 3: Memory >> ~1GB 情况3：内存>>〜1GB

Do as others suggested. 按照别人的建议去做。 Use a TreeSet or a HashSet. 使用TreeSet或HashSet。 For TreeSet, each insertion will be O(logn) so total is O(nlogn). 对于TreeSet，每个插入将为O（logn），所以总计为O（nlogn）。 However this will be less efficient than quicksort in terms of both time and space. 但是，就时间和空间而言，这将不如快速排序有效。 HashSet is more complicated depending on the hash function. HashSet更复杂，具体取决于哈希函数。 Under most circumstances, it will do OK, with O(n) time complexity. 在大多数情况下，它会很好，时间复杂度为O（n）。

删除重复项的好程序

问题描述

Procedure 1 程序1

Procedure 2 程序2

2 个解决方案

解决方案1
2 已采纳 2014-04-20 18:21:46

解决方案2
1 2014-04-20 18:33:26

删除重复项的好程序

问题描述

Procedure 1 程序1

Procedure 2 程序2

2 个解决方案

解决方案1 2 已采纳 2014-04-20 18:21:46

解决方案2 1 2014-04-20 18:33:26

解决方案1
2 已采纳 2014-04-20 18:21:46

解决方案2
1 2014-04-20 18:33:26