简体   繁体   English

迭代列表中的元素以检查是否在字典 python 和 numpy 中

[英]Iterate elements in list to check if in dictionary python & numpy

I want to use numpy to speed up my computation where in I have a dictionary and I want to create a vector from it based on the presence of words as keys in the dictionary.我想使用 numpy 来加速我的计算,在我有一个字典的地方,我想根据字典中作为键的单词的存在来创建一个向量。 I currently do this - a dummy example is provided for better understanding, the actual data is much larger:我目前这样做 - 提供了一个虚拟示例以便更好地理解,实际数据要大得多:

self.bigram_freq =  {"a cat":3, "man child"2, "pokemon team":4} 
sentences = ['a boy ran over a cat with his bike yesterday afternoon']
for sentence in sentences:
    feature_vector = []
    # generate pairs from this sentence and see if in bigram_freq
    bigram_pairs = self.retrieve_pairs(sentence)
    dummy_dict = dict.fromkeys(self.bigram_freq, 0)
    for pair in bigram_pairs:
        if pair in self.bigram_freq:
            dummy_dict[pair] +=1
    feature_vector = list(dict(sorted(dummy_dict.items(), key=lambda item: item[0])).values())
            outputVector.append(feature_vector)

But due to the two loops, its a lot slower.但是由于两个循环,它的速度要慢得多。 I was wondering if this could be sped up using numpy and np.where.我想知道这是否可以使用 numpy 和 np.where 来加速。 I was thinking of creating an array of np.zeros and then populating a specific index of the ndarray when the corresponding token (a pair from bigram_pairs) is present but I am unable to do so.我正在考虑创建一个 np.zeros 数组,然后在存在相应的标记(来自 bigram_pairs 的一对)时填充 ndarray 的特定索引,但我无法这样做。 Any help would be appreciated.任何帮助,将不胜感激。

I have attempted to fix your script like this in order to make it work:我试图像这样修复您的脚本以使其正常工作:

outputVector = []
bigram_freq =  {"a cat":3, "man child":2, "pokemon team":4} 
sentences = ['a boy ran over a cat with his bike yesterday afternoon', 
             'he was dreaming about pokemon team at the moment he hit a cat']
S = sentence.split(' ')
bigram_pairs = [f'{x} {y}' for x,y in zip(S[:-1], S[1:])] 
>>> bigram_pairs
['he was', 'was dreaming', 'dreaming about', 'about pokemon', 'pokemon team', 'team at', 'at the', 'the moment', 'moment he', 'he hit', 'hit a', 'a cat']

Now, in Python you do it like this:现在,在 Python 中,您可以这样做:

for sentence in sentences:
    dummy_dict = dict.fromkeys(bigram_freq, 0)
    for pair in bigram_pairs:
        if pair in bigram_freq:
            dummy_dict[pair] +=1
    feature_vector = list(dict(sorted(dummy_dict.items(), key=lambda item: item[0])).values())
    outputVector.append(feature_vector)        
>>> outputVector
[[1, 0, 1], [1, 0, 1]]

And you want to make it faster.你想让它更快。 Now take a look at this question .现在来看看这个问题 You don't actually need to create a list of all pairs because numpy allows you to check if a specific pair is in a sentence:您实际上不需要创建所有对的列表,因为numpy允许您检查特定对是否在句子中:

outputVector = []
match_with = list(bigram_freq)
for sentence in sentences:
    feature_vector = np.core.defchararray.find(sentence, match_with)!=-1
    outputVector.append(feature_vector)
>>> outputVector
[array([ True, False, False]), array([ True, False,  True])]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM