简体   繁体   English

Python - 查找没有重复内部项目的列表列表的最大子集

[英]Python - Find the biggest subset of a list of lists where no inner item is repeated

I have a list of lists where each sublist is composed by four items, in this format:我有一个列表列表,其中每个子列表由四个项目组成,格式如下:

ll = [["dog", "cat", "mouse", "pig"],
      ["pidgeon", "goose", "rat", "frog"],
      ["bird", "dog", "mouse", "pig"]
      ["wolf", "cat", "whale", "rhino"]
      ...
      ["chameleon", "bat", "zebra", "lion"]

I need to find the biggest combination of the inner lists where no string is ever repeated.我需要找到没有重复字符串的内部列表的最大组合。 My output list of lists should be in the same format as ll , so it should be a list of lists where each sublist is composed by four strings.我的列表输出列表的格式应该与ll相同,因此它应该是一个列表列表,其中每个子列表由四个字符串组成。 So my desired output would exclude ["dog", "cat", "mouse", "pig"] (the first sublist) since it shares the items "dog", "mouse" and "pig" with ["bird", "dog", "mouse", "pig"] (the third sublist) and the item "cat" with ["wolf", "cat", "whale", "rhino"] (the fourth sublist).所以我想要的输出将排除["dog", "cat", "mouse", "pig"] (第一个子列表),因为它与["bird", "dog", "mouse", "pig"] (第三个子列表) 和带有["wolf", "cat", "whale", "rhino"]的条目 "cat" (第四个子列表)。 Crucially, my desired output would not exclude the third and the fourth sublist, although that would be a combination of the inner lists where no string is repeated, because it would not be the biggest combination.至关重要的是,我想要的输出不会排除第三个和第四个子列表,尽管那将是没有重复字符串的内部列表的组合,因为它不会是最大的组合。

For now, I have followed two options, that are not desirable in two different ways:目前,我遵循了两个选项,它们在两种不同的方式中都是不可取的:

Option 1选项1

output = []
for comb in itertools.combinations(ll, 40):
    merged = set(itertools.chain.from_iterable(comb)) # flatten nested list
    if len(merged) == 160: # 40*4 = 160 --> no item is repeated
        output.append(comb)

The downsides of this option are that (a) it's not computationally efficient at all, and (b) I would be specifying a priori the number of inner lists that I aim for, instead of maximizing it.这个选项的缺点是 (a) 它在计算上根本没有效率,并且 (b) 我会先验地指定我的目标内部列表的数量,而不是最大化它。

Option 2选项 2

items = set()
unique = []
for quartet in ll:
    if set(quartet).isdisjoint(items):
        unique.append(quartet)
        for word in quartet:
            items.add(word)
print(unique)

The downsides of this option are that although it returns a list that meets my constraint (non repetition), it does not return the biggest one and the output is order sensitive.此选项的缺点是,虽然它返回一个满足我的约束(非重复)的列表,但它不会返回最大的列表,并且输出是顺序敏感的。

You can use your 2nd method with a little bit of preprocessing and a Greedy approach.您可以通过一些预处理和贪婪方法来使用第二种方法。

  • First you can traverse all elements in ll and store all unique elements and their counts in a dict.首先,您可以遍历ll所有元素并将所有唯一元素及其计数存储在 dict 中。
{
  "dog": 1,
  "cat": 2,
  ...
}
  • Then for every list in ll you can find out how many elements overlap (you can check if the value of that element in dict is greater than 1) and store that count.然后对于ll每个列表,您可以找出有多少元素重叠(您可以检查 dict 中该元素的值是否大于 1)并存储该计数。
  • Now you can sort ll on the basis of overlap count using sorted() function.现在您可以使用sorted()函数根据重叠计数对ll进行sorted()
  • And now you can run your 2nd method on the sorted ll现在你可以在排序的ll上运行你的第二个方法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM