Python - Find all the similar sentences between two documents using sklearn

Question

I am trying to return all the similar sentences between two documents, and the solution I have works, but it is extremely slow. Is there a more efficient way to accomplish this?

I open the two documents (A and B), and using NLTK I extract each sentence from both documents into list_a and list_b (fast). From there, I loop over each sentence in list_a, and compare each sentence to all the sentences in list_b (slow). If the two sentences are similar based on a percentage, I append the similar sentences to a results_list to review later.

The code I use to compare two sentences:

# Compare two sentences
def compare_sentences( sentences_a, sentences_b ):

    # Init our vectorizer
    vect = TfidfVectorizer( min_df = 1 )

    # Create our tfidf
    tfidf = vect.fit_transform( [ sentences_a, sentences_b ] )

    # Get an array of results
    results = ( tfidf * tfidf.T ).A

    # Return percentage float
    return float( '%.4f' % ( results[0][1] * 100 ) )

# end compare_sentences()

I've seen many helpful answers describing how to compare two documents in a general sense, but I would like to find a solution that provides a list of all similar sentences between the two.

I appreciate your help.

Answer 1

Have you profiled your code? That is always the first step when optimizing.

That being said, you're currently initializing the TfidfVectorizor on each pair of sentences- if you have m sentences in one file and n in another, that's m*n initializations. But that object doesn't depend on the sentences- you only need to do it once, and then pass it to the function. That might be some low hanging fruit to grab.

Python - Find all the similar sentences between two documents using sklearn

Question

1 answers

solution1
1 2014-06-04 01:29:05

Python - Find all the similar sentences between two documents using sklearn

Question

1 answers

solution1 1 2014-06-04 01:29:05

solution1
1 2014-06-04 01:29:05