简体   繁体   English

如何找到最少量的常见集

[英]How to find the least amount of common sets

Given the sets 鉴于集合

{1,2,3,4} {2,3,4} {1,4} {1} {1,2,3,4} {2,3,4} {1,4} {1}

What is an easy (and preferably performant) algorithm to find the groups: {1} {2,3} {4} 什么是查找组的简单(并且最好是高性能)算法:{1} {2,3} {4}

Since this is the shortest list of sets where: 由于这是最短的集合列表,其中:

  • all members (1-4) are represented. 所有成员(1-4)都有代表。
  • 2 and 3 are grouped together because they always appear together in the original sets. 2和3组合在一起,因为它们总是一起出现在原始集合中。

The real data is a bunch of references, not value types. 真实数据是一堆引用,而不是值类型。

EDIT: I don't think summarizing what I've tried does anything to help the question, and only serve as a distraction as there probably is an algorithm in category theory for this, but (for entertainment reasons) here goes: 编辑:我不认为总结我尝试过的东西可以帮助解决这个问题,并且只是为了分散注意力,因为类别理论中可能有一个算法,但是(出于娱乐原因)这里有:

  • I've aggregated on hash sets trying to use union operator. 我已经聚集在试图使用union运算符的哈希集上。
  • I've performed groupedby on aggregate on gethashcode. 我在gethashcode上进行了聚合分组。
  • I've iterated over the list using the first entry as a candidate set, seeking to gradually reduce it when comparing against other members. 我使用第一个条目作为候选集迭代了列表,试图在与其他成员进行比较时逐渐减少它。 This did not perform well and I'm not sure it ended up with the fewest amount of sets possible. 这不是很好,我不确定它最终可能会有最少的数量。

First off, let's carefully characterize your problem. 首先,让我们仔细描述您的问题。

A relation is a function that takes two arguments and returns a bool that indicates whether the relation holds or not. 关系是一个函数,它接受两个参数并返回一个bool,指示关系是否成立。 For example, "less than" is a relation. 例如,“小于”是一种关系。

An equivalence relation is a relation that is reflexive -- every item is related to itself -- symmetric -- if A is related to B then B is related to A -- and transitive -- if A is related to B and B is related to C, then A is related to C. 等价关系是一种自反的关系 - 每个项目都与自身相关 - 对称 - 如果A与B相关则B与A相关 - 并且传递 - 如果A与B相关且B与B相关到C,那么A与C有关。

An equivalence relation forms an equivalence partition of a set; 等价关系形成集合的等价分区 ; that is, a number of subsets where every element in each subset is related to each other. 也就是说,每个子集中的每个元素彼此相关的多个子集。 Each subset is called an equivalence class . 每个子集称为等价类 For example, the equivalence relation on integers "A and B are related if their difference is divisible by 3" forms three equivalence classes: 例如,整数“A和B的等价关系是相关的,如果它们的差异可以被3整除”则形成三个等价类:

{0, 3, -3, 6, -6, ... }
{1, 4, -2, 7, -5, ... }
{2, 5, -1, 8, -4, ... }

You wish to form the union of all your sets: 你希望形成你所有集合的联盟:

{1, 2, 3, 4} U {2, 3, 4} U {1, 4} U {1} --> {1, 2, 3, 4}

And then partition that set into equivalence classes, where the equivalence relation is "A and B are related if and only if A and B always appear together in each of the original sets". 然后将该集合划分为等价类,其中等价关系是“当且仅当A和B总是在每个原始集合中出现时,A和B是相关的”。

Start by forming a dictionary that maps each element to its associated equivalence class. 首先形成一个字典,将每个元素映射到其关联的等价类。 As you correctly point out, the worst case is that we have the equivalence partitioning where every equivalence class contains only one element, so let's start with that. 正如您正确指出的那样,最糟糕的情况是我们有等价分区,其中每个等价类只包含一个元素,所以让我们从那开始。 (This is the equivalence partitioning for the "A equals B" equivalence relation, incidentally.) (顺便说一下,这是“A等于B”等价关系的等价划分。)

1 --> { 1 }
2 --> { 2 }
3 --> { 3 }
4 --> { 4 }

Now produce the set of all unordered pairs from the union: 现在从联合生成所有无序对的集合:

{ {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4} }

Now for each of those unordered pairs, ask the question "does the relation hold for this pair"? 现在对于每个无序对,问一个问题“这对关系是否成立”?

For {1, 2} , {1, 3} , {1, 4} , the relation does not hold. 对于{1, 2}{1, 3}{1, 4} ,这种关系不成立。

For {2, 3} the relation does hold, so merge the 2 and 3 buckets together: 对于{2, 3} ,关系确实成立,因此将23桶合并在一起:

1 -->     { 1 }
2 --\ 
     -->  { 2, 3 }
3 --/
4 -->     { 4 }

For {2, 4} and {3, 4} the relation does not hold. 对于{2, 4}{3, 4}这种关系并不成立。

Now you're done, and you have a map from every element to its corresponding equivalence class. 现在你已经完成了,你有一个从每个元素到它对应的等价类的映射。

Make sense? 说得通?

There are a number of ways you can optimize this algorithm once you've got it correct. 一旦你弄清楚了这个算法,有许多方法可以优化它。 Get it correct first. 先把它弄好。

Notice what I did here: I solved your specific problem by solving the general problem of equivalence partitioning . 注意我在这里做了什么:我通过解决等价分区一般问题解决了你的具体问题。 If you're clever about how you write this, you'll be able to re-use the logic to solve any equivalence partitioning problem, not just your specific problem. 如果您对如何编写本文很聪明,那么您将能够重用逻辑来解决任何等价分区问题,而不仅仅是您的具体问题。

Here is one approach that arrives at the same answer you did: 这是一种方法,你得到了同样的答案:

var sets = new [] { new [] {1,2,3,4}, new [] {2,3,4}, new [] {1,4}, new [] {1}};
var results = sets.SelectMany((x,i) => x.Select(y => new { y, i }))
                .GroupBy(x => x.y).Select(x => new { x.Key, g = string.Join("", x.Select(y => y.i).ToArray())})
                .GroupBy(x => x.g).Select(x => x.Select(y => y.Key).ToArray()).ToArray();

I would probably define the result of this query to be the shortest list of smallest sets that can be used to compose the original sets. 我可能会将此查询的结果定义为可用于组成原始集的最小集的最短列表。 It uses the indices of the values as a means of grouping them. 它使用值的索引作为对它们进行分组的方法。 (1 appears in 0,2,3; 4 appears in 0,1,2 etc) 2 and 3 have the same index arrays so they are grouped together in the final result. (1出现在0,2,3; 4出现在0,1,2等)2和3具有相同的索引数组,因此它们在最终结果中组合在一起。

My first approach would not work correctly for the sets {1,2,3,4}, {2,3,4}, {1,4} (Answer should be {1}, {4}, {2,3}). 我的第一种方法不适用于集{1,2,3,4},{2,3,4},{1,4}(答案应为{1},{4},{2,3} )。 This one will. 这一个会。

Though Eric Lippert correctly described the solution, I didn't see how to create good parallel code for it. 虽然Eric Lippert正确地描述了解决方案,但我没有看到如何为它创建良好的并行代码。 I therefore had to use this approach. 因此我不得不使用这种方法。 My solution is as follows 我的解决方案如下

{1,2,3,4} {2,3,4} {1,4} {1}

Let's call the reference to these lists A,B,C and D, respectively. 我们分别称这些列表为A,B,C和D。

A :{1,2,3,4}
B: {2,3,4}
C: {1,4}
D: {1}

I performed SelectMany, associating each member with the reference to the list where it came from. 我执行了SelectMany,将每个成员与它所来自的列表的引用相关联。

A, 1
A, 2
A, 3
A, 4
B, 2
B, 3
B, 4
C, 1
C, 4
D, 1

Then I grouped them by the member. 然后我按成员分组。

1 : {A,C,D}
2 : {A,B}
3 : {A,B}
4 : {A,B,C}

(here we see that the 2 and 3 have similar lists, which is expected since they appear in the same original lists). (这里我们看到2和3有类似的列表,这是预期的,因为它们出现在相同的原始列表中)。 This is also the key point. 这也是关键点。

In order to find lists with the same members, I did an Aggregate() by XOR-ing the result of GetHashcode() over the list items. 为了找到具有相同成员的列表,我通过在列表项上对GetHashcode()的结果进行异或来做了一个Aggregate()。 So for "1", I effectively did 所以对于“1”,我实际上做到了

var SomeInt = A.GetHashcode()^C.GetHashcode()^D.GetHashcode().

Thus producing an int for each member. 从而为每个成员生成一个int。

1: SomeIntA
2: SomeIntB
3: SomeIntB
4: SomeIntC.

By Grouping on this. 通过对此进行分组。 I finally got the desired. 我终于得到了理想。 {1},{2,3},{4} {1},{2,3},{4}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM