简体   繁体   English

有没有更快的方法来查找列表列表中两个元素的共现

[英]Is there a faster way to find co-occurrence of two elements in list of lists

i have a list like this.我有一个这样的清单。

a = [
    ['a1', 'b2', 'c3'],
    ['c3', 'd4', 'a1'],
    ['b2', 'a1', 'e5'],
    ['d4', 'a1', 'b2'],
    ['c3', 'b2', 'a1']
    ]

I'll be given x (eg: 'a1').我会得到 x(例如:'a1')。 I have to find the co-occurrence of a1 with every other element and sort it and retrieve the top n (eg: top 2) my answer should be我必须找到 a1 与所有其他元素的共现并对其进行排序并检索前 n (例如:前 2)我的答案应该是

[
 {'product_id': 'b2', 'count': 4}, 
 {'product_id': 'c3', 'count': 3}, 
]

my current code looks like this:我当前的代码如下所示:

def compute (x):
    set_a = list(set(list(itertools.chain(*a))))
    count_dict = []
    for i in range(0, len(set_a)):
        count = 0
        for j in range(0, len(a)):
            if x==set_a[i]:
                continue
            if x and set_a[i] in a[j]:
                count+=1
        if count>0:
            count_dict.append({'product_id': set_a[i], 'count': count})
    count_dict = sorted(count_dict, key=lambda k: k['count'], reverse=True) [:2]
    return count_dict

And it works beautifully for smaller inputs.它适用于较小的输入。 However my actual input has 70000 unique items instead of 5 (a to e) and 1.3 million rows instead of 5. And hence mxn becomes very exhaustive.然而,我的实际输入有 70000 个唯一项目而不是 5(a 到 e)和 130 万行而不是 5。因此 mxn 变得非常详尽。 Is there a faster way to do this?有没有更快的方法来做到这一点?

"Faster" is a very general term. “更快”是一个非常笼统的术语。 Do you need a shorter total processing time, or shorter response time for a request?您是否需要更短的总处理时间或更短的请求响应时间? Is this for only one request, or do you want a system that handles repeated inputs?这仅适用于一个请求,还是您想要一个处理重复输入的系统?

If what you need is the fastest response time for repeated inputs, then convert this entire list of lists into a graph, with each element as a node, and the edge weight being the number of occurrences between the two elements.如果您需要的是重复输入的最快响应时间,那么将整个列表列表转换为图形,每个元素作为一个节点,边权重是两个元素之间的出现次数。 You make a single pass over the data to build the graph.您对数据进行一次传递以构建图形。 For each node, sort the edge list by weight.对于每个节点,按权重对边列表进行排序。 From there, each request is a simple lookup: return the weight of the node's top edge, which is a hash (linear function) and two direct access operations (base address + offset).从那里开始,每个请求都是一个简单的查找:返回节点顶部边缘的权重,即一个 hash(线性函数)和两个直接访问操作(基地址 + 偏移量)。


UPDATE after OP's response OP回复后更新

"fastest response" seals the algorithm, then.然后,“最快响应”密封了算法。 What you want to have is a simple dict, keyed by each node.您想要的是一个简单的字典,由每个节点键入。 The value of each node is a sorted list of related elements and their counts.每个节点的值是相关元素及其计数的排序列表。

A graph package (say, networkx ) will give you a good entry to this, but may not retain a node's edges in fast form, nor sorted by weight.图 package (例如networkx )将为您提供一个很好的入口,但可能不会以快速形式保留节点的边缘,也不会按重量排序。 Instead, pre-process your data base.相反,预处理您的数据库。 For each row, you have a list of related elements.对于每一行,您都有一个相关元素的列表。 Let's just look at the processing for some row in the midst of the data set;让我们看一下对数据集中间某行的处理; call the elements a5, b2, z1 , and the dict d .调用元素a5, b2, z1和字典d Assume that a5, b2 is already in your dict.假设a5, b2已经在您的字典中。

using `intertools`, Iterate through the six pairs.
(a5, b2):
    d[a5][b2] += 1
(a5, z1):
    d[a5][z1]  = 1  (creates a new entry under a5)
(b2, a5):
    d[b2][a5] += 1
(b2, z1):
    d[b2][z1]  = 1  (creates a new entry under b2)
(z1, a5):
    d[z1] = {}      (creates a new z1 entry in d)
    d[z1][a5]  = 1  (creates a new entry under z1)
(z1, b2):
    d[z1][b2]  = 1  (creates a new entry under z1)

You'll want to use defaultdict to save you some hassle to detect and initialize new entries.您将希望使用defaultdict来为您节省一些检测和初始化新条目的麻烦。

With all of that handled, you now want to sort each of those sub-dicts into order based on the sub-level values.处理完所有这些后,您现在希望根据子级别值对每个子字典进行排序。 This leaves you with an ordered sequence for each element.这为您留下了每个元素的有序序列。 When you need to access the top n connected elements, you go straight to the dict and extract them:当您需要访问前n连接元素时,您 go 直接到 dict 并提取它们:

top = d[elem][:n]

Can you finish the coding from there?你能从那里完成编码吗?

as mentioned by @prune it is not mentioned that do you want a shorter processing time or shorter response time.正如@prune 提到的,没有提到您想要更短的处理时间还是更短的响应时间。 So I will explain two approach to this problem所以我将解释解决这个问题的两种方法

  1. The optimised code approach (for less processing time)优化的代码方法(处理时间更短)
from heapq import nlargest
from operator import itemgetter

#say we have K THREADS
def compute (x,    top_n=2):
    # first you find the unique items and save them somewhere easily accessible
    set_a = list(set(list(itertools.chain(*a))))


    #first find that in which of your ROWS the x exists
    selected_rows=[]
    for i,row in enumerate(a): #this whole loop can be parallelized
        if x in row: 
            selected_rows.append(i) #append index of the row in selected_rows array
    
    
    # time complexity till now is still O(M*N) but this part can be run in parallel as well, as each row       # can be evaluated independently M items can be evaluated independently 
    # THE M rows can be run in parallel, if we have K threads
    # it is going to take us (M/K)*N time complexity to run it.


    count_dict=[]
    
    
    # now the same thing you did earlier but now in second loop we are looking for less rows
    for val in set_a:
        if val==x:
            continue
        count=0
        for ri in selected_rows: # this whole part can be parallelized as well
            if val in a[ri]:
                count+=1
        count_dict.append({'product_id':val, 'count': count})

    # if our selected rows size on worst case is M itself
    # and our unique values are U, the complexity
    # will be (U/K)*(M/K)*N


    

    res = nlargest(top_n, count_dict, key = itemgetter('count'))
    return res

Lets calculate time complexity here If we have K threads then让我们在这里计算时间复杂度 如果我们有 K 个线程,那么

O((M/K)*N)+O((U/K)*(M/K)*N))

where在哪里

M---> Total rows
N---> Total Columns
U---> Unique Values
K---> number of threads
  1. Graph approach as suggested by Prune Prune 建议的图形方法
# other approach
#adding to Prune approach
big_dictionary={}
set_a = list(set(list(itertools.chain(*a))))
for x in set_a:
    big_dictionary[x]=[]
    for y in set_a:
        count=0

        if x==y:
            continue
        for arr in a:
            if (x in arr) and (y in arr):
                count+=1
        big_dictionary[x].append((y,count))

for x in big_dictionary:
    big_dictionary[x]=sorted(big_dictionary[x], key=lambda v:v[1], reverse=True)

Lets calculate time complexity for this one here One time complexity will be:让我们在这里计算这个的时间复杂度一个时间复杂度将是:

O(U*U*M*N)

where在哪里

M---> Total rows
N---> Total Columns
U---> Unique Values

But once this big_dictionary is calculated once, It will take you just 1 step to get your topN values For example if we want to get top3 values for a1但是一旦这个 big_dictionary 被计算一次,它只需要 1 步就可以得到你的 topN 值例如,如果我们想得到 a1 的 top3 值

result=big_dictionary['a1'][:3]

I followed the defaultdict approach as suggested by @Prune above.我遵循上面@Prune 建议的defaultdict方法。 Here's the final code:这是最终的代码:

from collections import defaultdict

def recommender(input_item, b_list, n):
    count =[]
    top_items = []
    for x in b.keys():
        lst_2 = b[x]
        common_transactions = len(set(b_list) & set(lst_2))
        count.append(common_transactions)
    
    top_ids = list((np.argsort(count)[:-n-2:-1])[1::])
    top_values_counts = [count[i] for i in top_ids]
    
    key_list = list(b.keys())
    for i,v in enumerate(top_ids):
        item_id = key_list[v]
        top_items.append({item_id:  top_values_counts[i]})
    print(top_items)
    return top_items

a = [
        ['a1', 'b2', 'c3'],
        ['c3', 'd4', 'a1'],
        ['b2', 'a1', 'e5'],
        ['d4', 'a1', 'b2'],
        ['c3', 'b2', 'a1']
    ]

b = defaultdict(list) 
for i, s in enumerate(a):
    for key in s : 
        b[key].append(i)
        
input_item = str(input("Enter the item_id: "))
n = int(input("How many values to be retrieved? (eg: top 5, top 2, etc.): "))
top_items = recommender(input_item, b[input_item], n)

Here's the output for top 3 for 'a1':这是“a1”前 3 名的 output:

[{'b2': 4}, {'c3': 3}, {'d4': 2}]

Thanks!!!谢谢!!!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM