简体   繁体   中英

Find all potential similar documents out of a list of documents using clustering

I'm working with the quora question pairs csv file which I loaded into a pd dataframe and isolated the qid and question so my questions are in this form :

0        What is the step by step guide to invest in sh...
1        What is the step by step guide to invest in sh...
2        What is the story of Kohinoor (Koh-i-Noor) Dia...
3        What would happen if the Indian government sto...
.....
19408    What are the steps to solve this equation: [ma...
19409                           Is IMS noida good for BCA?
19410              How good is IMS Noida for studying BCA?

My dataset is actually bigger (500k questions) but I will use these questions to showcase my problem.

I want to identify pairs of questions that have a high probability of asking the same thing. I thought about the naive way, which is to turn each sentence into a vector using doc2vec and then for each sentence calculate the cosine similarity with every other sentence. Then, keep the one with the highest similarity and in the end print all those that have a high enough cosine similarity. The problem is this would take ages to finish so I need another approach.

Then I found an answer in another question that suggests to use clustering to solve a similar problem. So following is the code I implemented based on that answer.

"Load and transform the dataframe to a new one with only question ids and questions"
train_df = pd.read_csv("test.csv", encoding='utf-8')

questions_df=pd.wide_to_long(train_df,['qid','question'],i=['id'],j='drop')
questions_df=questions_df.drop_duplicates(['qid','question'])[['qid','question']]
questions_df.sort_values("qid", inplace=True)
questions_df=questions_df.reset_index(drop=True)

print(questions_df['question'])

# vectorization of the texts
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(questions_df['question'].values.astype('U'))
# used words (axis in our multi-dimensional space)
words = vectorizer.get_feature_names()
print("words", words)


n_clusters=30
number_of_seeds_to_try=10
max_iter = 300
number_of_process=2 # seads are distributed
model = KMeans(n_clusters=n_clusters, max_iter=max_iter, n_init=number_of_seeds_to_try, n_jobs=number_of_process).fit(X)

labels = model.labels_
# indices of preferable words in each cluster
ordered_words = model.cluster_centers_.argsort()[:, ::-1]

print("centers:", model.cluster_centers_)
print("labels", labels)
print("intertia:", model.inertia_)

texts_per_cluster = numpy.zeros(n_clusters)
for i_cluster in range(n_clusters):
    for label in labels:
        if label==i_cluster:
            texts_per_cluster[i_cluster] +=1

print("Top words per cluster:")
for i_cluster in range(n_clusters):
    print("Cluster:", i_cluster, "texts:", int(texts_per_cluster[i_cluster])),
    for term in ordered_words[i_cluster, :10]:
        print("\t"+words[term])

print("\n")
print("Prediction")

text_to_predict = "Why did Donald Trump win the elections?"
Y = vectorizer.transform([text_to_predict])
predicted_cluster = model.predict(Y)[0]
texts_per_cluster[predicted_cluster]+=1

print(text_to_predict)
print("Cluster:", predicted_cluster, "texts:", int(texts_per_cluster[predicted_cluster])),
for term in ordered_words[predicted_cluster, :10]:
    print("\t"+words[term])

I thought that this way I could find for each sentence the cluster that it most likely belongs in and then calculate the cosine similarity between all other questions of that cluster. This way instead of doing it on all the dataset I will be doing it on far fewer documents. However using the code for an example sentence "Why did Donald Trump win the elections?" I have the following results.

Prediction
Why did Donald Trump win the elections?
Cluster: 25 texts: 244
    trump
    donald
    clinton
    hillary
    president
    vote
    win
    election
    did
    think

I know that my sentence belongs to cluster 25 and I can see the top words for that cluster. However how could I access the sentences that are in this cluster. Is there any way to do it?

You can use predict to get clusters. And then use numpy to get all the documents from a specific cluster

clusters = model.fit_predict(X_train)

clusterX = np.where(clusters==0) 

indices = X_train[clusterX]

So now indices will have all the indices of documents from that cluster

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM