简体   繁体   English

Python-使用sklearn查找两个文档之间的所有相似句子

[英]Python - Find all the similar sentences between two documents using sklearn

I am trying to return all the similar sentences between two documents, and the solution I have works, but it is extremely slow. 我正在尝试返回两个文档之间的所有相似句子,并且我有解决方案,但是它非常慢。 Is there a more efficient way to accomplish this? 有没有更有效的方法来实现这一目标?

I open the two documents (A and B), and using NLTK I extract each sentence from both documents into list_a and list_b (fast). 我打开两个文档(A和B),然后使用NLTK将两个文档中的每个句子提取到list_a和list_b中(快速)。 From there, I loop over each sentence in list_a, and compare each sentence to all the sentences in list_b (slow). 从那里,我遍历list_a中的每个句子,并将每个句子与list_b中的所有句子进行比较(慢速)。 If the two sentences are similar based on a percentage, I append the similar sentences to a results_list to review later. 如果两个句子基于百分比相似,我会将相似的句子附加到results_list上,以供日后查看。

The code I use to compare two sentences: 我用来比较两个句子的代码:

# Compare two sentences
def compare_sentences( sentences_a, sentences_b ):

    # Init our vectorizer
    vect = TfidfVectorizer( min_df = 1 )

    # Create our tfidf
    tfidf = vect.fit_transform( [ sentences_a, sentences_b ] )

    # Get an array of results
    results = ( tfidf * tfidf.T ).A

    # Return percentage float
    return float( '%.4f' % ( results[0][1] * 100 ) )

# end compare_sentences()

I've seen many helpful answers describing how to compare two documents in a general sense, but I would like to find a solution that provides a list of all similar sentences between the two. 我已经看到许多有用的答案,它们描述了如何从一般意义上比较两个文档,但是我想找到一种解决方案,提供两个文档之间所有相似句子的列表。

I appreciate your help. 我感谢您的帮助。

Have you profiled your code? 您是否配置了代码? That is always the first step when optimizing. 这始终是优化的第一步。

That being said, you're currently initializing the TfidfVectorizor on each pair of sentences- if you have m sentences in one file and n in another, that's m*n initializations. 话虽这么说,您当前正在初始化每对句子上的TfidfVectorizo​​r-如果您在一个文件中有m个句子,而在另一个文件中有n个句子,那就是m * n个初始化。 But that object doesn't depend on the sentences- you only need to do it once, and then pass it to the function. 但是,该对象并不依赖于语句-您只需要执行一次,然后将其传递给函数。 That might be some low hanging fruit to grab. 那可能是一些难以捉摸的果实。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在两个文档之间找到相似的句子,并计算整个文档中每个部分的相似度 - Find similar sentences in between two documents and calculate similarity score for each section in whole documents 使用 Python 在数百万个文档中找到最常见的句子/短语 - Find the most common sentences/phrases among millions of documents using Python 使用聚类从文档列表中查找所有潜在的相似文档 - Find all potential similar documents out of a list of documents using clustering 在python中找到最相似的句子 - Finding most similar sentences among all in python 使用sklearn查找具有大量文档的两个文本之间的字符串相似度 - Use sklearn to find string similarity between two texts with large group of documents 使用Python查找图形中两个顶点(节点)之间的所有路径 - Find all paths between two vertices (nodes) in a graph using Python 在 Python 中查找具有多个条件的两个文档之间的字段 - Find field between two documents with multiple condition in Python 查找具有相似列 python 的两个表之间的缺失数据 - find missing datas between two tables with similar columns python 如何使用简单匹配系数找到两个句子之间的相似性度量? - How to find similarity measure between two sentences using Simple Matching Coefficient? 在python中找到两个标签之间的所有内容 - Find all the contents between two tags in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM