Python-使用sklearn查找两个文档之间的所有相似句子

Question

I am trying to return all the similar sentences between two documents, and the solution I have works, but it is extremely slow. 我正在尝试返回两个文档之间的所有相似句子，并且我有解决方案，但是它非常慢。 Is there a more efficient way to accomplish this? 有没有更有效的方法来实现这一目标？

I open the two documents (A and B), and using NLTK I extract each sentence from both documents into list_a and list_b (fast). 我打开两个文档（A和B），然后使用NLTK将两个文档中的每个句子提取到list_a和list_b中（快速）。 From there, I loop over each sentence in list_a, and compare each sentence to all the sentences in list_b (slow). 从那里，我遍历list_a中的每个句子，并将每个句子与list_b中的所有句子进行比较（慢速）。 If the two sentences are similar based on a percentage, I append the similar sentences to a results_list to review later. 如果两个句子基于百分比相似，我会将相似的句子附加到results_list上，以供日后查看。

The code I use to compare two sentences: 我用来比较两个句子的代码：

# Compare two sentences
def compare_sentences( sentences_a, sentences_b ):

    # Init our vectorizer
    vect = TfidfVectorizer( min_df = 1 )

    # Create our tfidf
    tfidf = vect.fit_transform( [ sentences_a, sentences_b ] )

    # Get an array of results
    results = ( tfidf * tfidf.T ).A

    # Return percentage float
    return float( '%.4f' % ( results[0][1] * 100 ) )

# end compare_sentences()

I've seen many helpful answers describing how to compare two documents in a general sense, but I would like to find a solution that provides a list of all similar sentences between the two. 我已经看到许多有用的答案，它们描述了如何从一般意义上比较两个文档，但是我想找到一种解决方案，提供两个文档之间所有相似句子的列表。

I appreciate your help. 我感谢您的帮助。

Answer 1

Have you profiled your code? 您是否配置了代码？ That is always the first step when optimizing. 这始终是优化的第一步。

That being said, you're currently initializing the TfidfVectorizor on each pair of sentences- if you have m sentences in one file and n in another, that's m*n initializations. 话虽这么说，您当前正在初始化每对句子上的TfidfVectorizor-如果您在一个文件中有m个句子，而在另一个文件中有n个句子，那就是m * n个初始化。 But that object doesn't depend on the sentences- you only need to do it once, and then pass it to the function. 但是，该对象并不依赖于语句-您只需要执行一次，然后将其传递给函数。 That might be some low hanging fruit to grab. 那可能是一些难以捉摸的果实。

Python-使用sklearn查找两个文档之间的所有相似句子

问题描述

1 个解决方案

解决方案1
1 2014-06-04 01:29:05

Python-使用sklearn查找两个文档之间的所有相似句子

问题描述

1 个解决方案

解决方案1 1 2014-06-04 01:29:05

解决方案1
1 2014-06-04 01:29:05