[英]Iterate over 36 million items in a list of tuples in python efficiently and faster
首先,在任何人将其标记为重复之前,请阅读以下内容。 我不确定迭代中的延迟是由于庞大的规模还是我的逻辑。 我有一个用例,我必须在元组列表中迭代超过3600 万个项目。 我的主要要求是速度和效率。 样品清单:
[
('how are you', 'I am fine'),
('how are you', 'I am not fine'),
...36 million items...
]
到目前为止我做了什么:
for query_question in combined:
query = "{}".format(word_tokenize(query_question[0]))
question = "{}".format(word_tokenize(query_question[1]))
# the function uses a naive doc2vec extension of GLOVE word vectors
vec1 = np.mean([
word_vector_dict[word]
for word in literal_eval(query)
if word in word_vector_dict
], axis=0)
vec2 = np.mean([
word_vector_dict[word]
for word in literal_eval(question)
if word in word_vector_dict
], axis=0)
similarity_score = 1 - distance.cosine(vec1, vec2)
store_question_score = store_question_score.append(
(query_question[1], similarity_score)
)
count += 1
if(count == len(data_list)):
store_question_score_descending = store_question_score.sort(
key=itemgetter(1), reverse=True
)
result_dict[query_question[0]] = store_question_score_descending[:5]
store_question_score =[]
count = 1
上述逻辑旨在计算问题之间的相似度分数并执行文本相似度算法。 我怀疑迭代中的延迟可能是vec1 and vec2
的计算。 如果是这样,我怎样才能做得更好? 我正在寻找如何加快这个过程。
还有很多其他问题类似于迭代巨大的列表,但我找不到任何可以解决我的问题的问题。
我非常感谢您能提供的任何帮助。
尝试缓存:
from functools import lru_cache
@lru_cache(maxsize=None)
def compute_vector(s):
return np.mean([
word_vector_dict[word]
for word in literal_eval(s)
if word in word_vector_dict
], axis=0)
然后改用这个:
vec1 = compute_vector(query)
vec2 = compute_vector(question)
如果向量的大小是固定的,您可以通过缓存到形状为(num_unique_keys, len(vec1))
的 numpy 数组做得更好,在您的情况下num_unique_keys = 370000 + 100
:
class VectorCache:
def __init__(self, func, num_keys, item_size):
self.func = func
self.cache = np.empty((num_keys, item_size), dtype=float)
self.keys = {}
def __getitem__(self, key):
if key in self.keys
return self.cache[self.keys[key]]
self.keys[key] = len(self.keys)
item = self.func(key)
self.cache[self.keys[key]] = item
return item
def compute_vector(s):
return np.mean([
word_vector_dict[word]
for word in literal_eval(s)
if word in word_vector_dict
], axis=0)
vector_cache = VectorCache(compute_vector, num_keys, item_size)
接着:
vec1 = vector_cache[query]
vec2 = vector_cache[question]
使用类似的技术,您还可以缓存余弦距离:
@lru_cache(maxsize=None)
def cosine_distance(query, question):
return distance.cosine(vector_cache[query], vector_cache[question])
首先,我建议使用 line_profiler 来确定花费最多的时间。
将装饰器添加到您想要分析的 function 并运行您的脚本(可能使用较少的数据量,因此不会花费太长时间)。
from line_profiler import LineProfiler
def do_profile(follow=[]):
def inner(func):
def profiled_func(*args, **kwargs):
try:
profiler = LineProfiler()
profiler.add_function(func)
for f in follow:
profiler.add_function(f)
profiler.enable_by_count()
return func(*args, **kwargs)
finally:
profiler.print_stats()
return profiled_func
return inner
@do_profile()
def encapsulating_function():
for query_question in combined:
query = "{}".format(word_tokenize(query_question[0]))
question = "{}".format(word_tokenize(query_question[1]))
vec1 = np.mean([word_vector_dict[word] for word in literal_eval(query) if word in word_vector_dict],axis=0) #the function uses a naive doc2vec extension of GLOVE word vectors
vec2 = np.mean([word_vector_dict[word] for word in literal_eval(question) if word in word_vector_dict],axis=0)
similarity_score = 1 - distance.cosine(vec1,vec2)
store_question_score = store_question_score.append((query_question[1],similarity_score))
count+=1
if(count == len(data_list)):
store_question_score_descending = store_question_score.sort(key=itemgetter(1),reverse=True)
result_dict[query_question[0]] = store_question_score_descending[:5]
store_question_score =[]
count = 1
( do_profile
function 复制自: https://zapier.com/engineering/profiling-python-boss/ )
在您这样做之前无法进一步帮助您
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.