迭代列表中的元素以检查是否在字典 python 和 numpy 中

Question

I want to use numpy to speed up my computation where in I have a dictionary and I want to create a vector from it based on the presence of words as keys in the dictionary.我想使用 numpy 来加速我的计算，在我有一个字典的地方，我想根据字典中作为键的单词的存在来创建一个向量。 I currently do this - a dummy example is provided for better understanding, the actual data is much larger:我目前这样做 - 提供了一个虚拟示例以便更好地理解，实际数据要大得多：

self.bigram_freq =  {"a cat":3, "man child"2, "pokemon team":4} 
sentences = ['a boy ran over a cat with his bike yesterday afternoon']
for sentence in sentences:
    feature_vector = []
    # generate pairs from this sentence and see if in bigram_freq
    bigram_pairs = self.retrieve_pairs(sentence)
    dummy_dict = dict.fromkeys(self.bigram_freq, 0)
    for pair in bigram_pairs:
        if pair in self.bigram_freq:
            dummy_dict[pair] +=1
    feature_vector = list(dict(sorted(dummy_dict.items(), key=lambda item: item[0])).values())
            outputVector.append(feature_vector)

But due to the two loops, its a lot slower.但是由于两个循环，它的速度要慢得多。 I was wondering if this could be sped up using numpy and np.where.我想知道这是否可以使用 numpy 和 np.where 来加速。 I was thinking of creating an array of np.zeros and then populating a specific index of the ndarray when the corresponding token (a pair from bigram_pairs) is present but I am unable to do so.我正在考虑创建一个 np.zeros 数组，然后在存在相应的标记（来自 bigram_pairs 的一对）时填充 ndarray 的特定索引，但我无法这样做。 Any help would be appreciated.任何帮助，将不胜感激。

Answer 1

I have attempted to fix your script like this in order to make it work:我试图像这样修复您的脚本以使其正常工作：

outputVector = []
bigram_freq =  {"a cat":3, "man child":2, "pokemon team":4} 
sentences = ['a boy ran over a cat with his bike yesterday afternoon', 
             'he was dreaming about pokemon team at the moment he hit a cat']
S = sentence.split(' ')
bigram_pairs = [f'{x} {y}' for x,y in zip(S[:-1], S[1:])] 
>>> bigram_pairs
['he was', 'was dreaming', 'dreaming about', 'about pokemon', 'pokemon team', 'team at', 'at the', 'the moment', 'moment he', 'he hit', 'hit a', 'a cat']

Now, in Python you do it like this:现在，在 Python 中，您可以这样做：

for sentence in sentences:
    dummy_dict = dict.fromkeys(bigram_freq, 0)
    for pair in bigram_pairs:
        if pair in bigram_freq:
            dummy_dict[pair] +=1
    feature_vector = list(dict(sorted(dummy_dict.items(), key=lambda item: item[0])).values())
    outputVector.append(feature_vector)        
>>> outputVector
[[1, 0, 1], [1, 0, 1]]

And you want to make it faster.你想让它更快。 Now take a look at this question .现在来看看这个问题。 You don't actually need to create a list of all pairs because numpy allows you to check if a specific pair is in a sentence:您实际上不需要创建所有对的列表，因为numpy允许您检查特定对是否在句子中：

outputVector = []
match_with = list(bigram_freq)
for sentence in sentences:
    feature_vector = np.core.defchararray.find(sentence, match_with)!=-1
    outputVector.append(feature_vector)
>>> outputVector
[array([ True, False, False]), array([ True, False,  True])]

迭代列表中的元素以检查是否在字典 python 和 numpy 中

问题描述

1 个解决方案

解决方案1
1 2021-12-03 00:50:25

迭代列表中的元素以检查是否在字典 python 和 numpy 中

问题描述

1 个解决方案

解决方案1 1 2021-12-03 00:50:25

解决方案1
1 2021-12-03 00:50:25