简体   繁体   中英

Python - Find the biggest subset of a list of lists where no inner item is repeated

I have a list of lists where each sublist is composed by four items, in this format:

ll = [["dog", "cat", "mouse", "pig"],
      ["pidgeon", "goose", "rat", "frog"],
      ["bird", "dog", "mouse", "pig"]
      ["wolf", "cat", "whale", "rhino"]
      ...
      ["chameleon", "bat", "zebra", "lion"]

I need to find the biggest combination of the inner lists where no string is ever repeated. My output list of lists should be in the same format as ll , so it should be a list of lists where each sublist is composed by four strings. So my desired output would exclude ["dog", "cat", "mouse", "pig"] (the first sublist) since it shares the items "dog", "mouse" and "pig" with ["bird", "dog", "mouse", "pig"] (the third sublist) and the item "cat" with ["wolf", "cat", "whale", "rhino"] (the fourth sublist). Crucially, my desired output would not exclude the third and the fourth sublist, although that would be a combination of the inner lists where no string is repeated, because it would not be the biggest combination.

For now, I have followed two options, that are not desirable in two different ways:

Option 1

output = []
for comb in itertools.combinations(ll, 40):
    merged = set(itertools.chain.from_iterable(comb)) # flatten nested list
    if len(merged) == 160: # 40*4 = 160 --> no item is repeated
        output.append(comb)

The downsides of this option are that (a) it's not computationally efficient at all, and (b) I would be specifying a priori the number of inner lists that I aim for, instead of maximizing it.

Option 2

items = set()
unique = []
for quartet in ll:
    if set(quartet).isdisjoint(items):
        unique.append(quartet)
        for word in quartet:
            items.add(word)
print(unique)

The downsides of this option are that although it returns a list that meets my constraint (non repetition), it does not return the biggest one and the output is order sensitive.

You can use your 2nd method with a little bit of preprocessing and a Greedy approach.

  • First you can traverse all elements in ll and store all unique elements and their counts in a dict.
{
  "dog": 1,
  "cat": 2,
  ...
}
  • Then for every list in ll you can find out how many elements overlap (you can check if the value of that element in dict is greater than 1) and store that count.
  • Now you can sort ll on the basis of overlap count using sorted() function.
  • And now you can run your 2nd method on the sorted ll

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM