高效、快速地迭代 python 中的元组列表中的超过 3600 万个项目

Question

首先，在任何人将其标记为重复之前，请阅读以下内容。 我不确定迭代中的延迟是由于庞大的规模还是我的逻辑。 我有一个用例，我必须在元组列表中迭代超过3600 万个项目。 我的主要要求是速度和效率。 样品清单：

[
    ('how are you', 'I am fine'),
    ('how are you', 'I am not fine'),
    ...36 million items...
]

到目前为止我做了什么：

for query_question in combined:
    query = "{}".format(word_tokenize(query_question[0]))
    question = "{}".format(word_tokenize(query_question[1]))

    # the function uses a naive doc2vec extension of GLOVE word vectors
    vec1 = np.mean([
        word_vector_dict[word]
        for word in literal_eval(query)
        if word in word_vector_dict
    ], axis=0)

    vec2 = np.mean([
        word_vector_dict[word]
        for word in literal_eval(question)
        if word in word_vector_dict
    ], axis=0)

    similarity_score = 1 - distance.cosine(vec1, vec2)
    store_question_score = store_question_score.append(
        (query_question[1], similarity_score)
    ) 
    count += 1

    if(count == len(data_list)):
        store_question_score_descending = store_question_score.sort(
            key=itemgetter(1), reverse=True
        )
        result_dict[query_question[0]] = store_question_score_descending[:5]
        store_question_score =[]
        count = 1

上述逻辑旨在计算问题之间的相似度分数并执行文本相似度算法。 我怀疑迭代中的延迟可能是vec1 and vec2的计算。 如果是这样，我怎样才能做得更好？ 我正在寻找如何加快这个过程。

还有很多其他问题类似于迭代巨大的列表，但我找不到任何可以解决我的问题的问题。

我非常感谢您能提供的任何帮助。

Answer 1

尝试缓存：

from functools import lru_cache

@lru_cache(maxsize=None)
def compute_vector(s):
    return np.mean([
        word_vector_dict[word]
        for word in literal_eval(s)
        if word in word_vector_dict
    ], axis=0)

然后改用这个：

vec1 = compute_vector(query)
vec2 = compute_vector(question)

如果向量的大小是固定的，您可以通过缓存到形状为(num_unique_keys, len(vec1))的 numpy 数组做得更好，在您的情况下num_unique_keys = 370000 + 100 ：

class VectorCache:
    def __init__(self, func, num_keys, item_size):
        self.func = func
        self.cache = np.empty((num_keys, item_size), dtype=float)
        self.keys = {}

    def __getitem__(self, key):
        if key in self.keys
            return self.cache[self.keys[key]]
        self.keys[key] = len(self.keys)
        item = self.func(key)
        self.cache[self.keys[key]] = item
        return item


def compute_vector(s):
    return np.mean([
        word_vector_dict[word]
        for word in literal_eval(s)
        if word in word_vector_dict
    ], axis=0)


vector_cache = VectorCache(compute_vector, num_keys, item_size)

接着：

vec1 = vector_cache[query]
vec2 = vector_cache[question]

使用类似的技术，您还可以缓存余弦距离：

@lru_cache(maxsize=None)
def cosine_distance(query, question):
    return distance.cosine(vector_cache[query], vector_cache[question])

Answer 2

首先，我建议使用 line_profiler 来确定花费最多的时间。

将装饰器添加到您想要分析的 function 并运行您的脚本（可能使用较少的数据量，因此不会花费太长时间）。

from line_profiler import LineProfiler

def do_profile(follow=[]):
    def inner(func):
        def profiled_func(*args, **kwargs):
            try:
                profiler = LineProfiler()
                profiler.add_function(func)
                for f in follow:
                    profiler.add_function(f)
                profiler.enable_by_count()
                return func(*args, **kwargs)
            finally:
                profiler.print_stats()
        return profiled_func
    return inner

@do_profile()
def encapsulating_function():
    for query_question in combined:
        query = "{}".format(word_tokenize(query_question[0]))
        question = "{}".format(word_tokenize(query_question[1]))

        vec1 = np.mean([word_vector_dict[word] for word in literal_eval(query) if word in word_vector_dict],axis=0) #the function uses a naive doc2vec extension of GLOVE word vectors
        vec2 = np.mean([word_vector_dict[word] for word in literal_eval(question) if word in word_vector_dict],axis=0)
        similarity_score = 1 - distance.cosine(vec1,vec2)
        store_question_score = store_question_score.append((query_question[1],similarity_score)) 
        count+=1
        if(count == len(data_list)):
            store_question_score_descending = store_question_score.sort(key=itemgetter(1),reverse=True)
            result_dict[query_question[0]] = store_question_score_descending[:5]
            store_question_score =[]
            count = 1

（ do_profile function 复制自： https://zapier.com/engineering/profiling-python-boss/ ）

在您这样做之前无法进一步帮助您

高效、快速地迭代 python 中的元组列表中的超过 3600 万个项目

问题描述

2 个解决方案

解决方案1
1 2021-05-04 03:02:47

解决方案2
0 2021-05-03 17:11:26

高效、快速地迭代 python 中的元组列表中的超过 3600 万个项目

问题描述

2 个解决方案

解决方案1 1 2021-05-04 03:02:47

解决方案2 0 2021-05-03 17:11:26

解决方案1
1 2021-05-04 03:02:47

解决方案2
0 2021-05-03 17:11:26