如何在Python的嵌套列表中获取具有最多共同元素的两个列表

Question

I have a list of list such as the following 我有一个列表，如下所示

[["This", "is","a", "test"], ["test","something", "here"], ["cat", "dog", "fish"]]

How would I get the two list that have the most words in common? 我如何获得具有最多单词的两个列表？ In this case, it would be the first and second list because they both have the word test in common 在这种情况下，它将是第一个和第二个列表，因为它们都具有共同的单词test

I have tried solving this by finding the intersection of every combination of two lists and keeping track of the combination with the highest amount of words in common. 我试图通过找到两个列表的每个组合的交集来解决这个问题，并跟踪具有最多共同词的组合。 This method however seems inefficient with say 100,000 lists. 然而，这种方法似乎效率低，例如100,000个列表。 It would be (100,000 choose 2) combinations I think. 我认为这将是（100,000选2）组合。 Is there a faster way to do this? 有更快的方法吗？

This is my code attempt 这是我的代码尝试

from itertools import combinations

a = [["This", "is","a", "test"], ["test","something", "here"], ["cat", "dog", "fish"]]
c= combinations(a,2)


max = 0
for pair in c:
    intersection = set(pair[0]).intersection(pair[1]) 
    if len(intersection) > max:
        max = len(intersection)
        mostsimilar = pair

print mostsimilar

The output of my program is what I expected however it is very slow on bigger test cases 我的程序的输出是我所期望的，但是在更大的测试用例上它非常慢

output: 输出：

(['This', 'is', 'a', 'test'], ['test', 'something', 'here'])

Answer 1

Based on my understanding of the problem, I think this should work: 根据我对这个问题的理解，我认为这应该有效：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KDTree, DistanceMetric

data = np.array([' '.join(words) for words in [['This', 'is', 'a', 'test'], ['test', 'something', 'here'], ['cat', 'dog', 'fish']]])

vectorised = CountVectorizer(binary=True).fit_transform(data).todense()
metric = DistanceMetric.get_metric('manhattan')
kdtree = KDTree(vectorised, metric=metric)
distances, indices = [result[:, 1] for result in kdtree.query(vectorised, k=2, dualtree=True)]

nearest_distances = (distances == distances.min())
print(data[nearest_distances])

Output: 输出：

['This is a test' 'test something here']

I recast the problem in the following manner: 我以下列方式重新解决问题：

Each list of words (or, sentence) can be represented as a row in a sparse matrix where a 1 in a particular column denotes the presence of a word, and 0 its absence, using sklearn 's CountVectorizer . 每个单词（或句子） list可以表示为稀疏矩阵中的一行，其中特定列中的1表示单词的存在，0表示不存在，使用sklearn的CountVectorizer 。

Then, we can see that the similarity of two sentences, as rows in the sparse matrix, can be determined by comparing the values of their elements in each column , which is simply the Manhattan distance. 然后，我们可以看到两个句子的相似性，如稀疏矩阵中的行， 可以通过比较每列中元素的值来确定 ，这就是曼哈顿距离。 This means that we have a nearest neighbour problem. 这意味着我们有一个最近邻居的问题。

sklearn also provides a k-dimensional tree class, which we can use to find the nearest two neighbours for each point in the dataset (since a point's nearest neighbour is itself). sklearn还提供了一个k维树类，我们可以使用它来为数据集中的每个点找到最近的两个邻居（因为一个点的最近邻居本身）。 Then, it remains to find the neighbours with the lowest distances, which we can use to index the original array. 然后，它仍然是找到具有最低距离的邻居，我们可以使用它来索引原始数组。

Using %%timeit , I tested the runtime of my solution vs blhsing's solution on the text on this page , leaving imports outside the timing loop: 使用%%timeit ，我测试了我的解决方案VS blhsing对上的文字解决方案的运行这个页面，留下将定时环外进口：

# my solution
198 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  
# blhsing's solution
4.76 s ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Restricting the length of sentences to those under 20 words: 将句子长度限制在20字以下：

# my solution
3.2 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# blhsing's solution
6.08 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Answer 2

A more efficient approach to solve this in O(nxm) time complexity (where n is the size of the list, and m is the average number of words in the sub-lists, so that if m is relatively small and constant compared to the size of the list, this solution would scale linearly to n ) is to iterate through the list of lists and build a seen dict that maps each word in a sub-list to a list of indices to the sub-lists that have been found to contain the word so far, and then use collections.Counter over the words in a given list to find the index to the most similar list with the most_common method: 在O（nxm）时间复杂度中解决此问题的更有效方法（其中n是列表的大小， m是子列表中的平均字数，因此如果m相对较小且相对于列表的大小，此解决方案将线性扩展到n ）是迭代列表列表并构建一个seen字典，将子列表中的每个单词映射到已找到的子列表的索引列表包含到目前为止的单词，然后使用collections.Counter对给定列表中的单词使用most_common方法查找最相似列表的索引：

from collections import Counter
seen = {}
most_common_pair = None
most_common_count = 0
for index, lst in enumerate(a):
    for common_index, common_count in Counter(
            i for word in lst for i in seen.get(word, ())).most_common(1):
        if common_count > most_common_count:
            most_common_count = common_count
            most_common_pair = a[common_index], lst
    for word in lst:
        seen.setdefault(word, []).append(index)

Given your sample input list in variable a , most_common_pair would become: 给定变量a样本输入列表， most_common_pair将变为：

(['This', 'is', 'a', 'test'], ['test', 'something', 'here'])

Answer 3

My tackle. 我的铲球。 I tested it with a list of 729 lists and it still works fast. 我用729个列表列表对它进行了测试，但它仍能快速运行。 I'm not sure on how faster it is, if at all honestly. 我不确定它的速度有多快，如果老实说的话。 But it doesn't use sets. 但它不使用集合。

Here (It has a test in it, just use the function) : 在这里（它有一个测试，只需使用该功能） ：

a = [1,2,3,4,5,6,7,8,9]
b = [9,10,11,12,13,14,15]
c = [11,3,99]
d = [9,10,11,12,50,1]
ls = [a,b,c,d]


for i in range(100,3000,2):
    ls.append([i,i+1,i+2,i+3])


def f(ls):
    wordDic = {}
    countDic = {}
    ind = 0
    for t in range(len(ls)):
        countDic[t]={}
    for l in ls:
        for word in l:
            try:
                for sharedIndex in wordDic[word]: #For every list that we already know has this word
                    try:
                        countDic[sharedIndex][ind] += 1  
                        try:
                            countDic[ind][sharedIndex] += 1
                        except KeyError:
                            countDic[ind][sharedIndex] = 1
                    except KeyError:
                        countDic[sharedIndex][ind] = 1
                wordDic[word].append(ind)
            except KeyError:
                wordDic[word] = [ind]
        ind += 1

    mx = 0
    ans = -1
    ind = 0
    for sLs in countDic.values():
        for rLr in sLs:
            if mx < sLs[rLr]:
                ans = (ls[ind],ls[rLr])
            mx = max(mx,sLs[rLr])
        ind += 1
    return ans

print(f(ls))

What it does: 它能做什么：

It's based on these two dictionaries: wordDic and countDic . 它基于这两个词典： wordDic和countDic 。

wordDic keys are each "word" that is used, with its value being a list of the indexes where said word was found. wordDic键是每个使用的“单词”，其值是找到所述单词的索引的列表。

countDic keys are the indexes of each list, and its values are dictionaries that hold how many countDic键是每个列表的索引，其值是包含多少列表的字典

countDic = { listInd: {otherLists:sharedAmount , ...} , ...} countDic = {listInd：{otherLists：sharedAmount，...}，...}

First it creates the dictionaries. 首先，它创建了词典。 Then it goes through each list once and saves the words that it has. 然后它会遍历每个列表一次并保存它所拥有的单词。 It adds its own index to the list of indexes that each word has, after adding one to the amount of "shared words" count in the second dictionary. 在将第二个字典中的“共享单词”数量加1后，它会将自己的索引添加到每个单词所具有的索引列表中。

When it finishes you'd have something like this: 完成后你会有这样的事情：

{0: {1: 1, 2: 1, 3: 2}, 1: {2: 1, 3: 4}, 2: {3: 1}, 3: {1: 3, 0: 1}} ([9, 10, 11, 12, 13, 14, 15], [9, 10, 11, 12, 50, 1]) {0：{1：1,2：1,3：2}，1：{2：1,3：4}，2：{3：1}，3：{1：3,0：1}}（ [9,10,11,12,13,14,15]，[9,10,11,12,50,1]]

which reads as: 读作：

{(List zero: elements shared with list 1 = 1, elements shared with 2=1, elements shared with list 3=2.),(List One: elements shared with list 2=1, elements shared with list 3=4 ),....} {（列表零：与列表1 = 1共享的元素，与2 = 1共享的元素，与列表3共享的元素= 2。），（列表一：与列表2共享的元素= 1，与列表3共享的元素= 4 ），...}

In this case List 1 shares the most elements with list 3 as any other. 在这种情况下，列表1与列表3共享大多数元素与其他元素。 The rest of the function simply goes through the dictionary and finds this maximum. 函数的其余部分只是通过字典并找到最大值。

I probably messed up my explanation. 我可能搞砸了我的解释。 I think it would be best that you check if the function works better than your own first, and then try to understand it. 我认为最好检查一下这个功能是否比你自己的功能更好，然后尝试理解它。

I also just noticed that you probably only need to add 1 to previously found lists, and don't need to add that other list to the one you're currently testing. 我还注意到您可能只需要将1添加到以前找到的列表中，而不需要将其他列表添加到您当前正在测试的列表中。 I'll see if that works 我会看看是否有效

EDIT1: It seems so. 编辑1：看起来如此。 The lines: 线条：

try:
    countDic[ind][sharedIndex] += 1
except KeyError:
    countDic[ind][sharedIndex] = 1

Can be commented out. 可以注释掉。

I hope this helps. 我希望这有帮助。

如何在Python的嵌套列表中获取具有最多共同元素的两个列表

问题描述

3 个解决方案

解决方案1
4 2019-04-29 22:58:31

解决方案2
3 2019-04-29 23:04:16

解决方案3
0 2019-04-29 23:00:59

如何在Python的嵌套列表中获取具有最多共同元素的两个列表

问题描述

3 个解决方案

解决方案1 4 2019-04-29 22:58:31

解决方案2 3 2019-04-29 23:04:16

解决方案3 0 2019-04-29 23:00:59

解决方案1
4 2019-04-29 22:58:31

解决方案2
3 2019-04-29 23:04:16

解决方案3
0 2019-04-29 23:00:59