简体   繁体   中英

How can I identify duplicates in a large array?

For example, I have an array with integers from 1 to 1,000,000. One of those numbers is duplicated inside the array. I would imagine that sorting the array numerically, and then looping through to check for when a value repeats would be the most efficient way of doing this, but it seems long winded and I'm not 100% sure. Anyone?

There is in fact a better way. You can loop over an array once and find a duplicate by using a seen values cache, which if you're using basic numbers is easy enough since they are valid Object keys

[1,2,3,3,4,5].filter(
  function(value){
    if(this[value]) return true
    this[value] = true
  },
  {}
)

If you wanted to know what the index it occurred you'll need to treat the cache a little differently

var cache = {}
[1,2,2,3,4,5,5].filter(
  function(value, index){
    if(this[value]){
      this[value].push(index)
      return true
    }
  this[value] = [index]
  },
  cache
)
Object.entries(cache)

Of course using a basic object might not always work so you may need to find a different data type to use as the cache, but it means you only have to loop over the array once.


This was an answer to a misinterpretation of how to make sure an array only contain unique references. When I find a proper place to put this I will move it then

There's not a better way, but in Javascript and for simple data types there is a more compact way thanks to Set and Array.from()

Array.from(new Set([1,1,2,2,3,3,4,4,5])) // [1,2,3,4,5]

This does not work for arrays of similar objects though as that requires inspection. But for simple things it works wonders. Of course with simple objects you COULD JSON.stringify() them all before sending it through Set .

This of course works because Set defines keys by the data itself. As such same values(pointer references or basic data types) are assigned the same place in the Set and so when the Array.from() iterates over the set it is now only iterating over the unique values. This might not be fast enough for large datasets(millions of records), but it's maintainable for smaller datasets and can easily simplify how it looks if nothing else

Suppose you always have to loop through the array one way or another. Something like (not sure how it'll work out for large arrays)

[ edit 2018/10/26] Nowadays using Set is a way to get unique Array entries. It looks pretty efficient. See adjusted snippet (note: creating an Array of ten million elements takes a while, be patient running it).

 let arrayUniquify = (arr = []) => [...new Set(arr)]; const largeArray = Array.from(new Array(10000000), () => Math.floor(Math.random()*10)); console.log(`largeArray.length: ${largeArray.length} (ten million elements)`); let start = performance.now(); const uniquified = arrayUniquify(largeArray); console.log(`uniquify lasted ${(performance.now() - start).toFixed(2)} ms`); console.log(`largeArray.length now: ${uniquified.length}`); console.log(`uniquified array: [${uniquified}]`); let objTmp = {}; // remove duplicates const uniqueValues = ["foo",2,4,5,5,7,8,9,0,"foo",8,9,5] .filter( v => v in objTmp ? false : (objTmp[v] = true) && true, objTmp) console.log(uniqueValues); // display duplicate values objTmp = {}; const doubleValues = ["foo",2,4,5,5,7,8,9,0,"foo",8,9,5] .filter( v => v in objTmp ? true : (objTmp[v] = true) && false, objTmp); console.log(doubleValues); 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM