简体   繁体   English

检查重复项时的性能

[英]Performance when checking for duplicates

I've been working on a project where I need to iterate through a collection of data and remove entries where the "primary key" is duplicated. 我一直在做一个项目,我需要遍历一组数据并删除重复“主键”的条目。 I have tried using a 我试过用过

List<int>

and

Dictionary<int, bool>

With the dictionary I found slightly better performance, even though I never need the Boolean tagged with each entry. 使用字典我发现性能略好,即使我从不需要为每个条目标记布尔值。 My expectation is that this is because a List allows for indexed access and a Dictionary does not. 我的期望是这是因为List允许索引访问而Dictionary不允许。 What I was wondering is, is there a better solution to this problem. 我想知道的是,这个问题是否有更好的解决方案。 I do not need to access the entries again, I only need to track what "primary keys" I have seen and make sure I only perform addition work on entries that have a new primary key. 我不需要再次访问这些条目,我只需要跟踪我看到的“主键”,并确保我只对具有新主键的条目执行附加工作。 I'm using C# and .NET 2.0. 我正在使用C#和.NET 2.0。 And I have no control over fixing the input data to remove the duplicates from the source (unfortunately!). 我无法控制修复输入数据以从源中删除重复项(不幸的是!)。 And so you can have a feel for scaling, overall I'm checking for duplicates about 1,000,000 times in the application, but in subsets of no more than about 64,000 that need to be unique. 所以你可以有一个扩展的感觉,总的来说我在应用程序中检查重复约1,000,000次,但是在不超过64,000的子集中需要是唯一的。

They have added the HashSet class in .NET 3.5. 他们在.NET 3.5中添加了HashSet类。 But I guess it will be on par with the Dictionary. 但我猜它会与词典相提并论。 If you have less than say a 100 elements a List will probably perform better. 如果你有少于100个元素,List可能会表现得更好。

Edit: Nevermind my comment. 编辑:没关系我的评论。 I thought you're talking about C++. 我以为你在谈论C ++。 I have no idea if my post is relevant in the C# world.. 我不知道我的帖子是否与C#世界有关..

A hash-table could be a tad faster. 哈希表可能更快一点。 Binary trees (that's what used in the dictionary) tend to be relative slow because of the way the memory gets accessed. 二进制树(这是字典中使用的)由于访问内存的方式而倾向于相对较慢。 This is especially true if your tree becomes very large. 如果您的树变得非常大,则尤其如此。

However, before you change your data-structure, have you tried to use a custom pool allocator for your dictionary? 但是,在更改数据结构之前,您是否尝试为字典使用自定义池分配器? I bet the time is not spent traversing the tree itself but in the millions of allocations and deallocations the dictionary will do for you. 我敢打赌,时间不是花在遍历树本身上,而是在数百万的分配和解除分配中,字典将为你做。

You may see a factor 10 speed-boost just plugging a simple pool allocator into the dictionary template. 您可能会看到一个因素10速度提升只是将一个简单的池分配器插入字典模板。 Afaik boost has a component that can be directly used. Afaik boost有一个可以直接使用的组件。

Another option: If you know only 64.000 entries in your integers exist you can write those to a file and create a perfect hash function for it. 另一种选择:如果您知道整数中只有64.000个条目,您可以将它们写入文件并为其创建完美的哈希函数。 That way you can just use the hash function to map your integers into the 0 to 64.000 range and index a bit-array. 这样你就可以使用哈希函数将整数映射到0到64.000范围并索引一个位数组。

Probably the fastest way, but less flexible. 可能是最快的方式,但不太灵活。 You have to redo your perfect hash function (can be done automatically) each time your set of integers changes. 每次整数集更改时,您必须重做完美的哈希函数(可以自动完成)。

I don't really get what you are asking. 我真的不明白你的要求。

Firstly is just the opposite of what you say. 首先是你所说的恰恰相反。 The dictionary has indexed access (is a hash table) while de List hasn't. 字典具有索引访问权限(是哈希表),而de List没有。

If you already have the data in a dictionary then all keys are unique, there can be no duplicates. 如果您已经在字典中拥有数据,那么所有键都是唯一的,则不会有重复项。

I susspect you have the data stored in another data type and you're storing it into the dictionary. 我认为你将数据存储在另一种数据类型中,然后将它存储到字典中。 If that's the case the inserting the data will work with two dictionarys. 如果是这种情况,插入数据将适用于两个dictionarys。

foreach (int key in keys)
{
  if (!MyDataDict.ContainsKey(key))
  {
    if (!MyDuplicatesDict.ContainsKey(key))
      MyDuplicatesDict.Add(key);
  }
  else
    MyDataDict.Add(key); 
}

If you are checking for uniqueness of integers, and the range of integers is constrained enough then you could just use an array. 如果要检查整数的唯一性,并且整数范围受到足够的限制,那么您可以只使用数组。

For better packing you could implement a bitmap data structure (basically an array, but each int in the array represents 32 ints in the key space by using 1 bit per key). 为了更好地打包,您可以实现位图数据结构(基本上是一个数组,但是通过每个键使用1位,数组中的每个int表示密钥空间中的32个int)。 That way if you maximum number is 1,000,000 you only need ~30.5KB of memory for the data structure. 这样,如果最大数量是1,000,000,那么数据结构只需要~30.5KB的内存。

Performs of a bitmap would be O(1) (per check) which is hard to beat. 位图的执行将是O(1)(每次检查),这是难以击败的。

There was a question awhile back on removing duplicates from an array . 从阵列中删除重复项有一段时间 For the purpose of the question performance wasn't much of a consideration, but you might want to take a look at the answers as they might give you some ideas. 出于问题的目的,性能不是很重要,但您可能希望看一下答案,因为它们可能会给您一些想法。 Also, I might be off base here, but if you are trying to remove duplicates from the array then a LINQ command like Enumerable.Distinct might give you better performance than something that you write yourself. 此外,我可能不在这里,但如果你试图从数组中删除重复项,那么像Enumerable.Distinct这样的LINQ命令可能会比你自己写的更好。 As it turns out there is a way to get LINQ working on .NET 2.0 so this might be a route worth investigating. 事实证明,有一种方法可以让LINQ在.NET 2.0上运行,因此这可能是一条值得研究的途径。

If you're going to use a List, use the BinarySearch: 如果您要使用List,请使用BinarySearch:

 // initailize to a size if you know your set size
List<int> FoundKeys = new List<int>( 64000 );
Dictionary<int,int> FoundDuplicates = new Dictionary<int,int>();

foreach ( int Key in MyKeys )
{
   // this is an O(log N) operation
   int index = FoundKeys.BinarySearch( Key );
   if ( index < 0 ) 
   {
       // if the Key is not in our list, 
       // index is the two's compliment of the next value that is in the list
       // i.e. the position it should occupy, and we maintain sorted-ness!
       FoundKeys.Insert( ~index, Key );
   }
   else 
   {
       if ( DuplicateKeys.ContainsKey( Key ) )
       {
           DuplicateKeys[Key]++;
       }
       else
       {
           DuplicateKeys.Add( Key, 1 );
       }
   } 
} 

You can also use this for any type for which you can define an IComparer by using an overload: BinarySearch( T item, IComparer< T > ); 您还可以将此用于任何可以使用重载定义IComparer的类型:BinarySearch(T item,IComparer <T>);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM