简体   繁体   English

如何识别大型阵列中的重复项?

[英]How can I identify duplicates in a large array?

For example, I have an array with integers from 1 to 1,000,000. 例如,我有一个整数从1到1,000,000的数组。 One of those numbers is duplicated inside the array. 这些数字之一在数组内重复。 I would imagine that sorting the array numerically, and then looping through to check for when a value repeats would be the most efficient way of doing this, but it seems long winded and I'm not 100% sure. 我想像一下,对数组进行数字排序,然后循环检查值何时重复是这样做的最有效方法,但是这似乎很漫长,我不确定100%。 Anyone? 任何人?

There is in fact a better way. 实际上有更好的方法。 You can loop over an array once and find a duplicate by using a seen values cache, which if you're using basic numbers is easy enough since they are valid Object keys 您可以循环遍历一次数组,并使用可见值缓存查找重复项,如果您使用的是基数,则这很容易,因为它们是有效的对象键

[1,2,3,3,4,5].filter(
  function(value){
    if(this[value]) return true
    this[value] = true
  },
  {}
)

If you wanted to know what the index it occurred you'll need to treat the cache a little differently 如果您想知道它发生了什么索引,则需要对缓存进行一些不同的处理

var cache = {}
[1,2,2,3,4,5,5].filter(
  function(value, index){
    if(this[value]){
      this[value].push(index)
      return true
    }
  this[value] = [index]
  },
  cache
)
Object.entries(cache)

Of course using a basic object might not always work so you may need to find a different data type to use as the cache, but it means you only have to loop over the array once. 当然,使用基本对象可能并不总是可行,因此您可能需要查找其他数据类型以用作缓存,但这意味着您只需要遍历数组一次。


This was an answer to a misinterpretation of how to make sure an array only contain unique references. 这是对如何确保数组仅包含唯一引用的误解的答案。 When I find a proper place to put this I will move it then 当我找到合适的放置位置时,我将其移动

There's not a better way, but in Javascript and for simple data types there is a more compact way thanks to Set and Array.from() 没有更好的方法,但是在Javascript中,对于简单的数据类型,有了SetArray.from()可以使用更紧凑的方法。

Array.from(new Set([1,1,2,2,3,3,4,4,5])) // [1,2,3,4,5]

This does not work for arrays of similar objects though as that requires inspection. 尽管这需要检查,但不适用于类似对象的数组。 But for simple things it works wonders. 但是对于简单的事情,它却能创造奇迹。 Of course with simple objects you COULD JSON.stringify() them all before sending it through Set . 当然,对于简单的对象,您可以在通过Set发送之前将其全部JSON.stringify()进行处理。

This of course works because Set defines keys by the data itself. 这当然有效,因为Set通过数据本身定义了键。 As such same values(pointer references or basic data types) are assigned the same place in the Set and so when the Array.from() iterates over the set it is now only iterating over the unique values. 这样,在Set Array.from()相同的值(指针引用或基本数据类型)分配给相同的位置,因此当Array.from()遍历set时,现在仅遍历唯一值。 This might not be fast enough for large datasets(millions of records), but it's maintainable for smaller datasets and can easily simplify how it looks if nothing else 对于大型数据集(数百万条记录)来说,这可能不够快,但是对于较小的数据集而言,它是可维护的,并且如果没有其他内容,可以轻松简化其外观

Suppose you always have to loop through the array one way or another. 假设您总是必须以一种或另一种方式遍历数组。 Something like (not sure how it'll work out for large arrays) 类似的东西(不确定如何处理大型数组)

[ edit 2018/10/26] Nowadays using Set is a way to get unique Array entries. [ edit 2018/10/26]现在,使用Set是获取唯一Array条目的一种方法。 It looks pretty efficient. 看起来效率很高。 See adjusted snippet (note: creating an Array of ten million elements takes a while, be patient running it). 请参阅调整后的代码段(注意:创建一千万个元素的数组需要一段时间,请耐心运行它)。

 let arrayUniquify = (arr = []) => [...new Set(arr)]; const largeArray = Array.from(new Array(10000000), () => Math.floor(Math.random()*10)); console.log(`largeArray.length: ${largeArray.length} (ten million elements)`); let start = performance.now(); const uniquified = arrayUniquify(largeArray); console.log(`uniquify lasted ${(performance.now() - start).toFixed(2)} ms`); console.log(`largeArray.length now: ${uniquified.length}`); console.log(`uniquified array: [${uniquified}]`); let objTmp = {}; // remove duplicates const uniqueValues = ["foo",2,4,5,5,7,8,9,0,"foo",8,9,5] .filter( v => v in objTmp ? false : (objTmp[v] = true) && true, objTmp) console.log(uniqueValues); // display duplicate values objTmp = {}; const doubleValues = ["foo",2,4,5,5,7,8,9,0,"foo",8,9,5] .filter( v => v in objTmp ? true : (objTmp[v] = true) && false, objTmp); console.log(doubleValues); 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM