简体   繁体   English

如何计算此python列表的频率?

[英]How to calculate the frequency of this python list?

I have a list of python list like this: 我有一个像这样的python列表:

base_list (About 3,000,000 sub lists):

[
   ['Hello','World','Lucy','Lily'],
   ['Hello','Smith','Simpson','Bart'],
   ....
]

Now i get a small list: 现在我得到一个小清单:

small_list:

['Hello','World']

Now, i need to find out how many times the small_list appears in the base_list. 现在,我需要找出small_list在base_list中出现多少次。

Appear means this : [1,3] is appears in [1,2,3,4,5] . 出现表示:[1,3]出现在[1,2,3,4,5]中。

UPDATE 更新

I've tried this: 我已经试过了:

1.Change the base_list into a list of set. 1.将base_list更改为set列表。

2.Then, change the small_list into a set too: 2.然后,也将small_list更改为一个集合:

def get_original_freq(self, actors):
    count = 0
    s = set(actors)
    for row in self.orignal_rows:
      if s.issubset(row):
        count += 1
    return count

But the code runs really slow, about 1000 records have been checked per second. 但是代码运行速度非常慢,每秒大约检查1000条记录。

My first reaction is to answer with a silly (albeit working) answer: 我的第一个反应是回答一个愚蠢的答案(尽管可行):

def sublistCount(listA, listB):
    if not len(listB):
        return 0
    conditions = ["%s in a" % repr(b) for b in listB]
    comprehension = '[a for a in listA if %s]' % ' and '.join(conditions)
    return len(eval(comprehension))

where listA is the list of lists and listB is the sublist. 其中listA是列表的列表,listB是子列表。

This is actually pretty fast, even when working with lists of strings. 即使使用字符串列表,这实际上也非常快。 I ran through a list of 3,000,000 lists of strings in about 1-2 seconds. 我在大约1-2秒内浏览了3,000,000个字符串列表。

I called it silly because it's using the eval() function to create code on the fly. 我之所以称其为愚蠢是因为它使用eval()函数动态创建代码。 If you're not sure what your input will be, this could be potentially dangerous. 如果您不确定输入的内容,则可能有潜在的危险。 This solution is the bassoon of the orchestra of possible solutions: it's funny, it works, but just one bad note or squeak makes it all bad. 这个解决方案是可能解决方案的乐团中的佼佼者:它很有趣,很奏效,但只有一个不好的音符或吱吱声会使一切变糟。

However, my favorite of the potential solutions is this: 但是,我最喜欢的潜在解决方案是:

def sublistCount(listA, listB):
    b = set(listB)
    matches = [a for a in listA if b.issubset(a)]
    return len(matches)

This is safer, much cleaner, and performs nearly as well as the first solution (for 3,000,000 records). 这样更安全,更清洁,并且性能几乎与第一个解决方案(用于3,000,000条记录)一样好。

I find out the Inverted Index helps me out: 我发现反向索引可以帮助我:

1.Make the base_list become a inverted index: 1.使base_list成为反向索引:

{
    'Hello': [1,5,10,8000]
    'World': [1,2,3,5,9]
    ...
}

2.When i need to count the ['Hello','World']'s count number of appearances. 2.当我需要计算['Hello','World']的出现次数时。 I just find the two inverted index of them and count their common documents. 我只是找到它们的两个倒排索引并计算它们的通用文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM