简体   繁体   English

使用 Java 从大型整数数组中删除重复项

[英]Remove duplicates from a large integer array using Java

Do you know of any time efficient way to remove duplicated values from a very big integer array using Java?您知道使用 Java 从非常大的整数数组中删除重复值的任何时间有效的方法吗? The size of the array depends on the logged in user, but will always exceed 1500000 unsorted values with some duplicates.数组的大小取决于登录的用户,但总是会超过 1500000 个未排序的值,并且有一些重复。 Every integer contains a number between 100000 and 9999999.每个整数都包含一个介于 100000 和 9999999 之间的数字。

I tried converting it to a List, but the heap on my server doesn't allow this amount of data(my ISP has restricted it).我尝试将其转换为列表,但我服务器上的堆不允许有这么多数据(我的 ISP 已对其进行了限制)。 And a regular for loop within a for loop takes over 5 minutes to calculate. for 循环中的常规 for 循环需要超过 5 分钟的时间来计算。

The size of the array without the duplicates is the one I will store in my database.没有重复的数组的大小是我将存储在我的数据库中的大小。

Help would be appreciated!帮助将不胜感激!

You could perhaps use a bit set?你也许可以使用一些设置? I don't know how efficient Java's BitSet is.我不知道 Java 的 BitSet 效率如何。 But 9999999 possible values would only take 9999999 / 8 = 1250000 bytes = just over 1Mb.但是 9999999 可能的值只需要 9999999 / 8 = 1250000 字节 = 刚好超过 1Mb。 As you walk the array of values, set the corresponding bit to true.在遍历值数组时,将相应的位设置为 true。 Then you can walk over the bit set and output the corresponding value whenever you find a bit set to true.然后,您可以遍历该位集并在发现某个位设置为 true 时输出相应的值。

1Mb will fit in a CPU cache, so this could be quite efficient depending on the bit set implementation. 1Mb 将适合 CPU 缓存,因此这可能非常有效,具体取决于位集实现。

This also has the side-effect of sorting the data too.这也有排序数据的副作用。

And... this is an O(n) algorithm since it requires a single pass over the input data, the set operations are O(1) (for an array-based set like this), and the output pass is also O(m) where m is the number of unique values and, by definition, must be <= n.而且...这是一个 O(n) 算法,因为它需要对输入数据进行一次传递,集合操作是 O(1)(对于像这样的基于数组的集合),并且输出传递也是 O( m) 其中 m 是唯一值的数量,根据定义,必须 <= n。

I would make a hashset where I store all values contained in the list, before i start adding items to the list.在开始向列表添加项目之前,我会创建一个哈希集,用于存储列表中包含的所有值。 Then just check so that the hashset doesn't contain the value you want to add.然后只需检查哈希集是否不包含您要添加的值。

Set<Integer> set = new HashSet<Integer>();
Collections.addAll(set, array);

you will just need an array of Integer[] instead of int[] .您只需要一个Integer[]数组而不是int[]

You can try sorting the array first:您可以先尝试对数组进行排序:

int arr[] = yourarray;
Arrays.sort(arr);
// then iterate arr and remove duplicates
int[] a;
Arrays.sort(a);
int j = 0;
for (int i = 1; i < a.length; ++i) {
  if (a[i] != a[j]) {
    ++j;
    a[j] = a[i];
  }
}
// now store the elements from 0 to j (inclusive - i think)

The truly desperate could write the array to disk and fork off sort | uniq | wc -l <infile.txt真正绝望的人可以将数组写入磁盘并分叉sort | uniq | wc -l <infile.txt sort | uniq | wc -l <infile.txt sort | uniq | wc -l <infile.txt and capture the output. sort | uniq | wc -l <infile.txt并捕获输出。 This would be needed if memory was still too tight or the domain space of integers got larger.如果内存仍然太紧或整数的域空间变大,则需要这样做。 I don't like this (is he even running unix!) but my point is that there are many ways to accomplish the task.我不喜欢这个(他甚至在运行 unix!)但我的观点是有很多方法可以完成任务。

Another observation is that the minimum value is 100,000.另一个观察结果是最小值为 100,000。 So we could subtract 100,000 from the maximum value of 9,999,999, reducing the domain space and thus saving some memory.所以我们可以从最大值 9,999,999 中减去 100,000,减少域空间,从而节省一些内存。 Perhaps 100k/8 bits is peanuts in the scheme of things, but it's essentially free to do it.也许 100k/8 位在事物的方案中是花生,但它本质上是免费的。

Perhaps you could make a handful of passes over the data?也许您可以对数据进行几次传递? For example, if you made ten passes over the data and applied one of the set suggestions above to a smaller subset of the data (say, when value mod pass# == 0).例如,如果您对数据进行了 10 次传递,并将上述设置建议之一应用于较小的数据子集(例如,当 value mod pass# == 0 时)。 Thus:因此:

for (int i = 0 to 9) {
  set = new Set()
  for (each entry in the data set) {
    if (entry % i == 0) {
      set.add(entry)
    }
  }
  output set
}

This way you will trade off time for memory (increase the number of passes for less memory/more time and vice-versa).通过这种方式,您将为内存权衡时间(增加传递次数以获得更少的内存/更多的时间,反之亦然)。

Maybe a hash set that works with primitives instead of objects will do the job?也许一个使用原语而不是对象的散列集可以完成这项工作? There are free implementations (havn't used them before but maybe it works):有免费的实现(以前没有使用过,但也许它有效):

http://trove4j.sourceforge.net/ http://trove4j.sourceforge.net/

http://trove4j.sourceforge.net/javadocs/gnu/trove/TIntHashSet.html http://trove4j.sourceforge.net/javadocs/gnu/trove/TIntHashSet.html

Would then look like:然后看起来像:

int[] newArray = new TIntHashSet(yourArray).toArray();

If you are sure, that integers have resonable small values (eg always more than zero and less than 1000 or 10000), you can try a trick like this:如果您确定整数具有合理的小值(例如总是大于零且小于 1000 或 10000),您可以尝试这样的技巧:

    final int MAX = 100; 
    int[] arrayWithRepeats = {99, 0, 10, 99, 0, 11, 99};

    //we are counting here integers with the same value
    int [] arrayOfValues = new int[MAX+1];
    int countOfUniqueIntegers = 0;
    for(int i : arrayWithRepeats) {
        if(arrayOfValues[i] == 0) {
            countOfUniqueIntegers++;
        }
        arrayOfValues[i]++;
    }

    // you can use arrayOfValues (smaller) or convert it
    // to table of unique values (more usable)

    int[] arrayOfUniqueValues = new int[countOfUniqueIntegers];
    int index = 0;
    for(int i = 0; i<arrayOfValues.length; i++) {
        if(arrayOfValues[i] != 0) {
            arrayOfUniqueValues[index] = i;
            index++;
        }
    }

    //and now arrayOfUniqueValues is even sorted
    System.out.println( Arrays.toString(arrayOfUniqueValues) );

Output: [0, 10, 11, 99]输出:[0, 10, 11, 99]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM