简体   繁体   中英

Python - Find all the similar sentences between two documents using sklearn

I am trying to return all the similar sentences between two documents, and the solution I have works, but it is extremely slow. Is there a more efficient way to accomplish this?

I open the two documents (A and B), and using NLTK I extract each sentence from both documents into list_a and list_b (fast). From there, I loop over each sentence in list_a, and compare each sentence to all the sentences in list_b (slow). If the two sentences are similar based on a percentage, I append the similar sentences to a results_list to review later.

The code I use to compare two sentences:

# Compare two sentences
def compare_sentences( sentences_a, sentences_b ):

    # Init our vectorizer
    vect = TfidfVectorizer( min_df = 1 )

    # Create our tfidf
    tfidf = vect.fit_transform( [ sentences_a, sentences_b ] )

    # Get an array of results
    results = ( tfidf * tfidf.T ).A

    # Return percentage float
    return float( '%.4f' % ( results[0][1] * 100 ) )

# end compare_sentences()

I've seen many helpful answers describing how to compare two documents in a general sense, but I would like to find a solution that provides a list of all similar sentences between the two.

I appreciate your help.

Have you profiled your code? That is always the first step when optimizing.

That being said, you're currently initializing the TfidfVectorizor on each pair of sentences- if you have m sentences in one file and n in another, that's m*n initializations. But that object doesn't depend on the sentences- you only need to do it once, and then pass it to the function. That might be some low hanging fruit to grab.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM