[英]Get clusters of words using Kmeans and TF-IDF
I am trying to clusters text words.我正在尝试对文本单词进行聚类。 Let suppose I have a list of text
假设我有一个文本列表
text=["WhatsApp extends 'confusing' update deadline",
"India begins world's biggest Covid vaccine drive",
"Nepali climbers make history with K2 winter summit"]
I implemented TF-IDF on this data我在这个数据上实现了 TF-IDF
vec = TfidfVectorizer()
feat = vec .fit_transform(text)
After that, I applied Kmeans之后,我应用了 Kmeans
kmeans = KMeans(n_clusters=num).fit(feat)
The thing I am confused about is how I get clusters of words such as我感到困惑的是我如何获得诸如
cluster 0
WhatsApp, update,biggest
cluster 1
history,biggest ,world's
etc.
You can use the get_feature_names()
method from the TfidfVectorizer
class with the predictions from KMeans
to inspect the words in each cluster.您可以使用来自
KMeans
class 的get_feature_names()
方法和来自TfidfVectorizer
的预测来检查每个集群中的单词。
Here's a minimal example with two clusters and the three sentence provided by you:这是一个包含两个集群和您提供的三个句子的最小示例:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
text = ["WhatsApp extends 'confusing' update deadline",
"India begins world's biggest Covid vaccine drive",
"Nepali climbers make history with K2 winter summit"]
vec = TfidfVectorizer()
feat = vec.fit_transform(text)
kmeans = KMeans(2).fit(feat)
pred = kmeans.predict(feat)
for i in range(2):
print(f"Cluster #{i}:")
words = []
for sentence in np.array(text)[pred==i]:
words += [fn for fn in vec.get_feature_names() if fn in sentence]
print(words)
Result:结果:
Cluster #0:
['confusing', 'deadline', 'extends', 'update', 'begins', 'biggest', 'drive', 'vaccine', 'world']
Cluster #1:
['climbers', 'history', 'make', 'summit', 'winter', 'with']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.