简体   繁体   English

从2个集合中查找添加和删除的高效算法

[英]Efficient algorithm to find additions and removals from 2 collections

Hi I would like to implement an efficient algorithm to handle the following case: 嗨,我想实现一个有效的算法来处理以下情况:

Lets assume we have 2 lists with the following elements: 让我们假设我们有2个列表,其中包含以下元素:

Source: [a,b,c,d,e] New: [d,e,f,g] 来源:[a,b,c,d,e]新:[d,e,f,g]

Now I have to update source with the new information. 现在我必须使用新信息更新源代码。 The algorithm should be able to find that 'f' and 'g' are new entries, that 'a', 'b' and 'c' has been removed and that 'd' and 'e' have not being modified. 算法应该能够找到'f'和'g'是新条目,'a','b'和'c'已被删除,'d'和'e'没有被修改。

The operations involved are set-intersect operations between Source and New, and viceversa. 涉及的操作是Source和New之间的set-intersect操作,反之亦然。 I am looking for an efficient algorithm to implement in C# for arbitrary non-sorted enumerations. 我正在寻找一种有效的算法,在C#中实现任意非排序枚举。

Thanks in advance, 提前致谢,

var added = New.Except(Source);
var removed = Source.Except(New);
var notModified = Source.Intersect(New);

If you want to have an approach where you "show your workings", I'd suggest putting them each into HashSets, as that allows for a fast Contains check, compared with other enumerations. 如果你想要一个“显示你的工作”的方法,我建议把它们分别放入HashSet,因为与其他枚举相比,它允许快速Contains检查。

Edit: 编辑:

Okay, if we're going for total speed at the cost of efficiency of expression, then with the following assumptions: 好吧,如果我们以牺牲表达效率为代价来提高总速度,那么请考虑以下假设:

  1. We have a reasonably hash-able type of item (if not, but they can be absolutely sorted, then a SortedList might beat a hash-set). 我们有一个合理的可散列类型的项目(如果没有,但它们可以绝对排序,然后SortedList可能击败散列集)。
  2. We cannot predict whether Source or New will be larger (in the example, there's a slight advantage of doing this the other way around to how I have this, but I'm assuming that is just by chance in the data and that we have to expect each with equal likelihood. 我们无法预测Source或New是否会更大(在这个例子中,这样做有一点点优势,反过来说我有这个,但我假设这只是偶然的数据,我们必须期望每个人的可能性相等。

Then I would suggest: 那我建议:

HashSet<T> removed = Source as HashSet<T> ?? new HashSet<T>(Source);
LinkedList<T> added = new LinkedList<T>();
LinkedList<T> notModified = new LinkedList<T>();
foreach(T item in New)
    if(removed.Remove(item))
        notModified.AddLast(item);
    else
        added.AddLast(item);

In setting up removed I test if it's already a hashset to avoid a wasteful building of a new one (I assume the input is typed as IEnumerable<T> ). 在设置removed我测试它是否已经是一个哈希集以避免浪费地构建一个新的(我假设输入被输入为IEnumerable<T> )。 Of course, this is a destructive action so we may wish to avoid it anyway. 当然,这是一种破坏性的行为,所以我们可能希望无论如何都要避免它。

Note also that I modify the hashset while enumerating through it. 另请注意,我在枚举时修改了hashset。 This is allowed by hashset, but outside of the guarantees given by the enumerators, so is implementation-depended. 这是hashset允许的,但在枚举数给出的保证之外,依赖于实现。 Still, with the current framework impl. 仍然,与目前的框架impl。 it's more efficient to do so than test and add to a different removed collection. 这样做比测试和添加到不同的删除集合更有效。

I went for linked-lists for the two other collections, as they tend to come out well in terms of insertion cost (not just O(1), but a fast O(1) compared to using another set). 我选择了另外两个集合的链接列表,因为它们在插入成本方面往往表现良好(不仅仅是O(1),而是使用另一个集合时的快速O(1))。

Now, if you want to go further still, there're probably micro-optimisations available in the implementation of hash-set if you roll your own. 现在,如果你想更进一步,如果你自己动手,那么在hash-set的实现中可能会有微优化。

I have not tested this for performance, but my gut feeling is that you should first sort the two lists. 我没有对性能进行测试,但我的直觉是你应该先对两个列表进行排序。 Then you can step through the lists key each removed, added or unchanged element as you progress. 然后,您可以在进度时逐步执行列表键,每个已删除,添加或未更改的元素。

1- Sort the Old and New list
2- Set up a pointer for each list lets call them p1 and p2
3- Step the pointers using the following algorithm
  a) If Old[p1] = New[p2] the items are unchanged, increment p1 and p2
  b) If Old[p1] < New[p2] then Old[p1] has been removed, increment p1
  c) If Old[p1] > new[p2] then New[p2] is a new element, increment p2
  d) If p1 > Old.ItemCount then break out of loop, rest of New contains new items
  e) If p2 > New.ItemCount then break out of loop, rest of Old items have been removed
  f) If p1 < Old.ItemCount and p2 < Old.ItemCount Goto step **a**

That was just off the top of my head, but the basics should be correct. 这只是我的头脑,但基本应该是正确的。 The key to this is that the lists are sorted of course. 关键是这些列表当然是排序的。

Here is a quick and dirty demo, I included the sort for demonstration purposed, of course in this case the data is already sorted. 这是一个快速而肮脏的演示,我包含了用于演示的类型,当然在这种情况下数据已经排序。

static void Main(string[] args)
{
  string[] oldList = { "a", "b", "c", "d", "e" };
  string[] newList = { "d", "e", "f", "g" };      

  Array.Sort(oldList);
  Array.Sort(newList);

  int p1 = 0;
  int p2 = 0;

  while (p1 < oldList.Length && p2 < newList.Length)
  {
    if (string.Compare(oldList[p1], newList[p2]) == 0)
    {
      Console.WriteLine("Unchanged:\t{0}", oldList[p1]);
      p1++;
      p2++;
    }
    else if (string.Compare(oldList[p1], newList[p2]) < 0)
    {
      Console.WriteLine("Removed:\t{0}", oldList[p1]);
      p1++;
    }
    else if (string.Compare(oldList[p1], newList[p2]) > 0)
    {
      Console.WriteLine("Added:\t\t{0}", newList[p2]);
      p2++;
    }        
  }

  while (p1 < oldList.Length)
  {
    Console.WriteLine("Removed:\t{0}", oldList[p1]);
    p1++;
  }

  while (p2 < newList.Length)
  {
    Console.WriteLine("Added :\t\t{0}", newList[p2]);
    p2++;
  }

  Console.ReadKey();
}

The output from the above 从上面输出

Removed:        a
Removed:        b
Removed:        c
Unchanged:      d
Unchanged:      e
Added :         f
Added :         g

You might use the set operations available in Linq. 您可以使用Linq中提供的set操作

string[] list1 = { "a","b","c","d","e"};
string[] list2 = { "d", "e", "f", "g" };

string[] newElements = list2.Except(list1).ToArray();
string[] commonElements = list2.Intersect(list1).ToArray();
string[] removedElements = list1.Except(list2).ToArray(); 

Note: The above code assumes that each of the lists is distinct, ie does not contain the same element more than once. 注意:上面的代码假定每个列表都是不同的,即多次不包含相同的元素。 For example, for the lists [a, b, c, c] and [a, b, c] the code won't detect the removed element. 例如,对于列表[a,b,c,c]和[a,b,c],代码将不会检测已删除的元素。

Call the sets X and Y. If set X supports rapid lookups, and you have a convenient means of "tagging" and "untagging" items in it, you could start by tagging all the items in X, and then query X for each item in Y. If an item isn't found, the item is "new" in Y. If the item is found, it's common to both sets and you should untag it in X. Repeat for all items in Y. When you're done, any items in X that are still tagged have been "deleted" from Y. 调用集合X和Y.如果集合X支持快速查找,并且您可以方便地“标记”和“取消标记”其中的项目,则可以首先标记X中的所有项目,然后查询每个项目的X在Y.如果找不到某个项目,则该项目在Y中为“新”。如果找到该项目,则两个集合都是通用的,您应该在X中取消它。对Y中的所有项目重复。当你是完成后,X中仍然标记的任何项目都已从Y“删除”。

This approach only requires one of the sets to support convenient queries and tagging. 此方法仅需要其中一个集合来支持方便的查询和标记。 It requires querying one set for all the records in the other, and then grabbing from it all items that haven't generated hits. 它需要查询另一组中所有记录的一组,然后从中获取所有未生成命中的项目。 There is no requirement to sort either set. 无需对任何一组进行排序。

我认为你所看到的是设置操作,即工会等。看看这篇文章: http//srtsolutions.com/public/item/251070

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM