简体   繁体   中英

Remove duplicates from a large integer array using Java

Do you know of any time efficient way to remove duplicated values from a very big integer array using Java? The size of the array depends on the logged in user, but will always exceed 1500000 unsorted values with some duplicates. Every integer contains a number between 100000 and 9999999.

I tried converting it to a List, but the heap on my server doesn't allow this amount of data(my ISP has restricted it). And a regular for loop within a for loop takes over 5 minutes to calculate.

The size of the array without the duplicates is the one I will store in my database.

Help would be appreciated!

You could perhaps use a bit set? I don't know how efficient Java's BitSet is. But 9999999 possible values would only take 9999999 / 8 = 1250000 bytes = just over 1Mb. As you walk the array of values, set the corresponding bit to true. Then you can walk over the bit set and output the corresponding value whenever you find a bit set to true.

1Mb will fit in a CPU cache, so this could be quite efficient depending on the bit set implementation.

This also has the side-effect of sorting the data too.

And... this is an O(n) algorithm since it requires a single pass over the input data, the set operations are O(1) (for an array-based set like this), and the output pass is also O(m) where m is the number of unique values and, by definition, must be <= n.

I would make a hashset where I store all values contained in the list, before i start adding items to the list. Then just check so that the hashset doesn't contain the value you want to add.

Set<Integer> set = new HashSet<Integer>();
Collections.addAll(set, array);

you will just need an array of Integer[] instead of int[] .

You can try sorting the array first:

int arr[] = yourarray;
Arrays.sort(arr);
// then iterate arr and remove duplicates
int[] a;
Arrays.sort(a);
int j = 0;
for (int i = 1; i < a.length; ++i) {
  if (a[i] != a[j]) {
    ++j;
    a[j] = a[i];
  }
}
// now store the elements from 0 to j (inclusive - i think)

The truly desperate could write the array to disk and fork off sort | uniq | wc -l <infile.txt sort | uniq | wc -l <infile.txt sort | uniq | wc -l <infile.txt and capture the output. This would be needed if memory was still too tight or the domain space of integers got larger. I don't like this (is he even running unix!) but my point is that there are many ways to accomplish the task.

Another observation is that the minimum value is 100,000. So we could subtract 100,000 from the maximum value of 9,999,999, reducing the domain space and thus saving some memory. Perhaps 100k/8 bits is peanuts in the scheme of things, but it's essentially free to do it.

Perhaps you could make a handful of passes over the data? For example, if you made ten passes over the data and applied one of the set suggestions above to a smaller subset of the data (say, when value mod pass# == 0). Thus:

for (int i = 0 to 9) {
  set = new Set()
  for (each entry in the data set) {
    if (entry % i == 0) {
      set.add(entry)
    }
  }
  output set
}

This way you will trade off time for memory (increase the number of passes for less memory/more time and vice-versa).

Maybe a hash set that works with primitives instead of objects will do the job? There are free implementations (havn't used them before but maybe it works):

http://trove4j.sourceforge.net/

http://trove4j.sourceforge.net/javadocs/gnu/trove/TIntHashSet.html

Would then look like:

int[] newArray = new TIntHashSet(yourArray).toArray();

If you are sure, that integers have resonable small values (eg always more than zero and less than 1000 or 10000), you can try a trick like this:

    final int MAX = 100; 
    int[] arrayWithRepeats = {99, 0, 10, 99, 0, 11, 99};

    //we are counting here integers with the same value
    int [] arrayOfValues = new int[MAX+1];
    int countOfUniqueIntegers = 0;
    for(int i : arrayWithRepeats) {
        if(arrayOfValues[i] == 0) {
            countOfUniqueIntegers++;
        }
        arrayOfValues[i]++;
    }

    // you can use arrayOfValues (smaller) or convert it
    // to table of unique values (more usable)

    int[] arrayOfUniqueValues = new int[countOfUniqueIntegers];
    int index = 0;
    for(int i = 0; i<arrayOfValues.length; i++) {
        if(arrayOfValues[i] != 0) {
            arrayOfUniqueValues[index] = i;
            index++;
        }
    }

    //and now arrayOfUniqueValues is even sorted
    System.out.println( Arrays.toString(arrayOfUniqueValues) );

Output: [0, 10, 11, 99]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM