[英]Python - Find all the similar sentences between two documents using sklearn
I am trying to return all the similar sentences between two documents, and the solution I have works, but it is extremely slow. 我正在尝试返回两个文档之间的所有相似句子,并且我有解决方案,但是它非常慢。 Is there a more efficient way to accomplish this?
有没有更有效的方法来实现这一目标?
I open the two documents (A and B), and using NLTK I extract each sentence from both documents into list_a and list_b (fast). 我打开两个文档(A和B),然后使用NLTK将两个文档中的每个句子提取到list_a和list_b中(快速)。 From there, I loop over each sentence in list_a, and compare each sentence to all the sentences in list_b (slow).
从那里,我遍历list_a中的每个句子,并将每个句子与list_b中的所有句子进行比较(慢速)。 If the two sentences are similar based on a percentage, I append the similar sentences to a results_list to review later.
如果两个句子基于百分比相似,我会将相似的句子附加到results_list上,以供日后查看。
The code I use to compare two sentences: 我用来比较两个句子的代码:
# Compare two sentences
def compare_sentences( sentences_a, sentences_b ):
# Init our vectorizer
vect = TfidfVectorizer( min_df = 1 )
# Create our tfidf
tfidf = vect.fit_transform( [ sentences_a, sentences_b ] )
# Get an array of results
results = ( tfidf * tfidf.T ).A
# Return percentage float
return float( '%.4f' % ( results[0][1] * 100 ) )
# end compare_sentences()
I've seen many helpful answers describing how to compare two documents in a general sense, but I would like to find a solution that provides a list of all similar sentences between the two. 我已经看到许多有用的答案,它们描述了如何从一般意义上比较两个文档,但是我想找到一种解决方案,提供两个文档之间所有相似句子的列表。
I appreciate your help. 我感谢您的帮助。
Have you profiled your code? 您是否配置了代码? That is always the first step when optimizing.
这始终是优化的第一步。
That being said, you're currently initializing the TfidfVectorizor on each pair of sentences- if you have m sentences in one file and n in another, that's m*n initializations. 话虽这么说,您当前正在初始化每对句子上的TfidfVectorizor-如果您在一个文件中有m个句子,而在另一个文件中有n个句子,那就是m * n个初始化。 But that object doesn't depend on the sentences- you only need to do it once, and then pass it to the function.
但是,该对象并不依赖于语句-您只需要执行一次,然后将其传递给函数。 That might be some low hanging fruit to grab.
那可能是一些难以捉摸的果实。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.