简体   繁体   English

给定一个字符串列表列表,找到最频繁的一对字符串,第二个最频繁的对,.....,然后是最频繁的字符串三元组,等等

[英]Given a list of lists of strings, find most frequent pair of strings, second most frequent pair, ....., then most frequent triplet of strings, etc

I have a list that contains k lists of strings (each of these k lists do not have any duplicate string).我有一个包含 k 个字符串列表的列表(这些 k 个列表中的每一个都没有任何重复的字符串)。 We know the union of all possible strings (suppose we have n unique strings).我们知道所有可能字符串的并集(假设我们有 n 个唯一字符串)。

What we need to find is: What is the most frequent pair of strings (ie, which 2 strings appear together the most across the k lists? And the second most frequent pair of strings, the third most frequent pair of strings, etc. Also, I'd like to know the most frequent triplet of strings, the second most frequent triplet of strings, etc.我们需要找到的是:出现频率最高的字符串对是什么(即,在 k 个列表中,哪两个字符串一起出现最多?第二最频繁的字符串对,第三最频繁的字符串对,等等。另外,我想知道最频繁的字符串三元组,第二频繁的字符串三元组,等等。

The only algorithm that I could think of to do this is of terrible complexity, where basically to solve for the most frequent pair, I'd enumerate all possible pairs out of the n strings (O(n^2)) and for each of them check how many lists have them (O(k)) and then I'll sort the results to get what I need, and so my overall complexity is O(n^2.x), ignoring the last sort.我能想到的唯一算法是非常复杂的,基本上是为了解决最常见的对,我将从 n 个字符串 (O(n^2)) 和每个字符串中枚举所有可能的对他们检查有多少个列表 (O(k)) 然后我会对结果进行排序以获得我需要的结果,所以我的整体复杂度是 O(n^2.x),忽略最后一个排序。

Any ideas for a better algorithm time-wise?关于更好的算法时间方面的任何想法? (that would hopefully work well for triplets of strings and quadruplets of strings, etc)? (这有望适用于三连音和四连音等)? Code in python is best, but detailed pseudocode (and data structure, if relevant) or detailed general idea is fine, too! python 中的代码最好,但详细的伪代码(和数据结构,如果相关)或详细的总体思路也可以!

For example: If例如:如果

myList=[['AB', 'AC', 'ACC'], ['AB','ACC'],['ACC'],['AC','ACC'],['ACC','BB','AC']], 

Then the expected output of the pairs question would be: 'AC','ACC' is the most frequent pair and 'AB','ACC' is the second most frequent pair.那么对问题的预期输出将是:'AC','ACC' 是最频繁的对,'AB','ACC' 是第二频繁的对。

You can use combinations , Counter and frozenset :您可以使用combinationsCounterfrozenset

from itertools import combinations
from collections import Counter

combos = (combinations(i, r=2) for i in myList)
Counter(frozenset(i) for c in combos for i in c).most_common(2)

Output:输出:

[(frozenset({'AC', 'ACC'}), 3), (frozenset({'AB', 'ACC'}), 2)]

This is a general solution for all length of combinations:这是所有组合长度的通用解决方案:

import itertools
def most_freq(myList, n):
    d={} #create a dictionary that will keep pair:frequency
    for i in myList:
        if len(i)>=n:
            for k in itertools.combinations(i, n): #generates all combinations of length n in i
                if k in d: #increases the frequency for this pair by 1
                    d[k]+=1
                else:
                    d[k]=1
    return {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}  #this just sorts the dictionary based on the value, in descending order

Examples:例子:

myList=[['AB', 'AC', 'ACC'], ['AB','ACC'],['ACC'],['AC','ACC'],['ACC','BB','AC']]

>>> most_freq(myList,2)
{('AB', 'ACC'): 2, ('AC', 'ACC'): 2, ('AB', 'AC'): 1, ('ACC', 'BB'): 1, ('ACC', 'AC'): 1, ('BB', 'AC'): 1}
>>> most_freq(myList,3)
{('AB', 'AC', 'ACC'): 1, ('ACC', 'BB', 'AC'): 1}

Found a snippet on my hard drive, check if it helps you:在我的硬盘上找到一个片段,检查它是否对你有帮助:

from collections import Counter
from itertools import combinations

mylist = [['AB', 'AC', 'ACC'], ['AB','ACC'],['ACC'],['AC','ACC'],['ACC','BB','AC']]
d  = Counter()
for s in mylist:
    if len(mylist) < 2:
        continue
    s.sort()
    for c in combinations(s,2):
        d[c] += 1

print(list(d.most_common()[0][0]))

Will return the list ['AC','ACC']将返回列表['AC','ACC']

I have a rather simple approach, without using any libraries.我有一个相当简单的方法,不使用任何库。
Firstly, for each list inside the main list, we can compute the hash for every pair of string.首先,对于主列表中的每个列表,我们可以计算每对字符串的哈希值。 (more on string hashing here: https://cp-algorithms.com/string/string-hashing.html ). (更多关于字符串散列在这里: https : //cp-algorithms.com/string/string-hashing.html )。 Maintain a dictionary, that holds the count for each hash occurred.维护一个字典,保存每个散列发生的计数。 In the end, we just need to sort the dictionary to get all pairs, ranked in order of their occurrence count.最后,我们只需要对字典进行排序以获取所有对,按出现次数排序。

Example: [['AB', 'AC', 'ACC', 'TR'], ['AB','ACC']]示例: [['AB', 'AC', 'ACC', 'TR'], ['AB','ACC']]
For list 1, that is ['AB', 'AC', 'ACC', 'TR'] ,对于列表 1,即['AB', 'AC', 'ACC', 'TR']
Compute hash for the pairs "AB AC", "AC ACC", "ACC TR" and correspondingly add them to the dictionary.计算“AB AC”、“AC ACC”、“ACC TR”对的哈希值,并相应地将它们添加到字典中。 Repeat the same for all lists inside the main list.对主列表中的所有列表重复相同的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM