如何可视化tf-idf向量的数据点以进行kmeans聚类？

Question

I have a list of documents and the tf-idf score for each unique word in the entire corpus. 我有一份文件清单和整个语料库中每个独特单词的tf-idf分数。 How do I visualize that on a 2-d plot to give me a gauge of how many clusters I will need to run k-means? 我如何在二维图上形象化，以便计算出运行k-means需要多少个簇？

Here is my code: 这是我的代码：

sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples:  %d, num_features: %d" %(num_samples,num_features)
num_clusters=10

As you can see, I am able to transform my sentences into a tf-idf document matrix. 如您所见，我能够将我的句子转换为tf-idf文档矩阵。 But I am unsure how to plot the data points of the tf-idf score. 但我不确定如何绘制tf-idf分数的数据点。

I was thinking: 我刚在想：

Add more variables like document length and something else 添加更多变量，如文档长度和其他内容
do PCA to get an output of 2 dimensions 做PCA以获得2维的输出

Thanks 谢谢

Answer 1

I am doing something similar at the moment, trying to plot in 2D, tf-idf scores for a dataset of texts. 我正在做类似的事情，试图在2D，tf-idf分数中绘制文本数据集。 My approach, similar to suggestions in other comments, is to use PCA and t-SNE from scikit-learn. 我的方法与其他评论中的建议类似，是使用来自scikit-learn的PCA和t-SNE。

import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

num_clusters = 10
num_seeds = 10
max_iterations = 300
labels_color_map = {
    0: '#20b2aa', 1: '#ff7373', 2: '#ffe4e1', 3: '#005073', 4: '#4d0404',
    5: '#ccc0ba', 6: '#4700f9', 7: '#f6f900', 8: '#00f91d', 9: '#da8c49'
}
pca_num_components = 2
tsne_num_components = 2

# texts_list = some array of strings for which TF-IDF is being computed

# calculate tf-idf of texts
tf_idf_vectorizer = TfidfVectorizer(analyzer="word", use_idf=True, smooth_idf=True, ngram_range=(2, 3))
tf_idf_matrix = tf_idf_vectorizer.fit_transform(texts_list)

# create k-means model with custom config
clustering_model = KMeans(
    n_clusters=num_clusters,
    max_iter=max_iterations,
    precompute_distances="auto",
    n_jobs=-1
)

labels = clustering_model.fit_predict(tf_idf_matrix)
# print labels

X = tf_idf_matrix.todense()

# ----------------------------------------------------------------------------------------------------------------------

reduced_data = PCA(n_components=pca_num_components).fit_transform(X)
# print reduced_data

fig, ax = plt.subplots()
for index, instance in enumerate(reduced_data):
    # print instance, index, labels[index]
    pca_comp_1, pca_comp_2 = reduced_data[index]
    color = labels_color_map[labels[index]]
    ax.scatter(pca_comp_1, pca_comp_2, c=color)
plt.show()



# t-SNE plot
embeddings = TSNE(n_components=tsne_num_components)
Y = embeddings.fit_transform(X)
plt.scatter(Y[:, 0], Y[:, 1], cmap=plt.cm.Spectral)
plt.show()

Answer 2

PCA is one approach. PCA是一种方法。 For TF-IDF I have also used Scikit Learn's manifold package for non-linear dimension reduction. 对于TF-IDF，我还使用了Scikit Learn的歧管包来减少非线性尺寸。 One thing that I find helpful is to label my points based on the TF-IDF scores. 我觉得有用的一件事是根据TF-IDF分数标记我的分数。

Here's an example (need to insert your TF-IDF implementation at beginning): 这是一个例子（需要在开头插入你的TF-IDF实现）：

from sklearn import manifold

# Insert your TF-IDF vectorizing here

##
# Do the dimension reduction
##
k = 10 # number of nearest neighbors to consider
d = 2 # dimensionality
pos = manifold.Isomap(k, d, eigen_solver='auto').fit_transform(.toarray())

##
# Get meaningful "cluster" labels
##
#Semantic labeling of cluster. Apply a label if the clusters max TF-IDF is in the 99% quantile of the whole corpus of TF-IDF scores
labels = vectorizer.get_feature_names() #text labels of features
clusterLabels = []
t99 = scipy.stats.mstats.mquantiles(X.data, [ 0.99])[0]
clusterLabels = []
for i in range(0,vectorized.shape[0]):
    row = vectorized.getrow(i)
    if row.max() >= t99:
        arrayIndex = numpy.where(row.data == row.max())[0][0]
        clusterLabels.append(labels[row.indices[arrayIndex]])
    else:
        clusterLabels.append('')
##
# Plot the dimension reduced data
##
pyplot.xlabel('reduced dimension-1')
pyplot.ylabel('reduced dimension-2')
for i in range(1, len(pos)):
    pyplot.scatter(pos[i][0], pos[i][1], c='cyan')
    pyplot.annotate(clusterLabels[i], pos[i], xytext=None, xycoords='data', textcoords='data', arrowprops=None)

pyplot.show()

Answer 3

I suppose you were looking for t-SNE , by van der Maaten and Hinton. 我想你正在寻找van der Maaten和Hinton的t-SNE 。

The publication: http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf 出版物： http ： //jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

This links to an IPython Notebook for doing this with sklearn . 这链接到一个IPython Notebook，用于使用sklearn执行此sklearn 。

In a nutshell, t-SNE is like PCA, but better at grouping objects related in a high-dimensional space on the 2-dim. 简而言之，t-SNE就像PCA，但更好的是在2维度的高维空间中对相关对象进行分组。 space of a plot. 情节的空间。

Answer 4

According to your requirement you can plot your scipy.sparse.csr.csr_matrix 根据您的要求，您可以绘制scipy.sparse.csr.csr_matrix

TfidfVectorizer.fit_transform() will give you (document id, term no) tf-idf score. TfidfVectorizer.fit_transform（）将为您提供（文档ID，术语号）tf-idf得分。 now you can create a numpy matrix by term as your x-axis and document as your y-axis, 2nd option is to plot(temm , tf-tdf score) or you can plot 3-d with (term , document, frequency) here you can apply PCA also. 现在你可以按术语创建一个numpy矩阵作为你的x轴，将文档创建为y轴，第二个选项是绘制（temm，tf-tdf得分），或者你可以用（术语，文档，频率）绘制3-d在这里你也可以申请PCA。

Just create a numpy matrix from scipy.sparse.csr.csr_matrix and use matplotlib. 只需从scipy.sparse.csr.csr_matrix创建一个numpy矩阵并使用matplotlib。

如何可视化tf-idf向量的数据点以进行kmeans聚类？

问题描述

4 个解决方案

解决方案1
9 2017-02-01 13:04:10

解决方案2
5 2015-01-15 00:53:35

解决方案3
3 2015-01-17 00:27:11

解决方案4
1 2015-01-11 20:51:49

如何可视化tf-idf向量的数据点以进行kmeans聚类？

问题描述

4 个解决方案

解决方案1 9 2017-02-01 13:04:10

解决方案2 5 2015-01-15 00:53:35

解决方案3 3 2015-01-17 00:27:11

解决方案4 1 2015-01-11 20:51:49

解决方案1
9 2017-02-01 13:04:10

解决方案2
5 2015-01-15 00:53:35

解决方案3
3 2015-01-17 00:27:11

解决方案4
1 2015-01-11 20:51:49