简体   繁体   中英

Given a list of lists of strings, find most frequent pair of strings, second most frequent pair, ....., then most frequent triplet of strings, etc

I have a list that contains k lists of strings (each of these k lists do not have any duplicate string). We know the union of all possible strings (suppose we have n unique strings).

What we need to find is: What is the most frequent pair of strings (ie, which 2 strings appear together the most across the k lists? And the second most frequent pair of strings, the third most frequent pair of strings, etc. Also, I'd like to know the most frequent triplet of strings, the second most frequent triplet of strings, etc.

The only algorithm that I could think of to do this is of terrible complexity, where basically to solve for the most frequent pair, I'd enumerate all possible pairs out of the n strings (O(n^2)) and for each of them check how many lists have them (O(k)) and then I'll sort the results to get what I need, and so my overall complexity is O(n^2.x), ignoring the last sort.

Any ideas for a better algorithm time-wise? (that would hopefully work well for triplets of strings and quadruplets of strings, etc)? Code in python is best, but detailed pseudocode (and data structure, if relevant) or detailed general idea is fine, too!

For example: If

myList=[['AB', 'AC', 'ACC'], ['AB','ACC'],['ACC'],['AC','ACC'],['ACC','BB','AC']], 

Then the expected output of the pairs question would be: 'AC','ACC' is the most frequent pair and 'AB','ACC' is the second most frequent pair.

You can use combinations , Counter and frozenset :

from itertools import combinations
from collections import Counter

combos = (combinations(i, r=2) for i in myList)
Counter(frozenset(i) for c in combos for i in c).most_common(2)

Output:

[(frozenset({'AC', 'ACC'}), 3), (frozenset({'AB', 'ACC'}), 2)]

This is a general solution for all length of combinations:

import itertools
def most_freq(myList, n):
    d={} #create a dictionary that will keep pair:frequency
    for i in myList:
        if len(i)>=n:
            for k in itertools.combinations(i, n): #generates all combinations of length n in i
                if k in d: #increases the frequency for this pair by 1
                    d[k]+=1
                else:
                    d[k]=1
    return {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}  #this just sorts the dictionary based on the value, in descending order

Examples:

myList=[['AB', 'AC', 'ACC'], ['AB','ACC'],['ACC'],['AC','ACC'],['ACC','BB','AC']]

>>> most_freq(myList,2)
{('AB', 'ACC'): 2, ('AC', 'ACC'): 2, ('AB', 'AC'): 1, ('ACC', 'BB'): 1, ('ACC', 'AC'): 1, ('BB', 'AC'): 1}
>>> most_freq(myList,3)
{('AB', 'AC', 'ACC'): 1, ('ACC', 'BB', 'AC'): 1}

Found a snippet on my hard drive, check if it helps you:

from collections import Counter
from itertools import combinations

mylist = [['AB', 'AC', 'ACC'], ['AB','ACC'],['ACC'],['AC','ACC'],['ACC','BB','AC']]
d  = Counter()
for s in mylist:
    if len(mylist) < 2:
        continue
    s.sort()
    for c in combinations(s,2):
        d[c] += 1

print(list(d.most_common()[0][0]))

Will return the list ['AC','ACC']

I have a rather simple approach, without using any libraries.
Firstly, for each list inside the main list, we can compute the hash for every pair of string. (more on string hashing here: https://cp-algorithms.com/string/string-hashing.html ). Maintain a dictionary, that holds the count for each hash occurred. In the end, we just need to sort the dictionary to get all pairs, ranked in order of their occurrence count.

Example: [['AB', 'AC', 'ACC', 'TR'], ['AB','ACC']]
For list 1, that is ['AB', 'AC', 'ACC', 'TR'] ,
Compute hash for the pairs "AB AC", "AC ACC", "ACC TR" and correspondingly add them to the dictionary. Repeat the same for all lists inside the main list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM