简体繁体 English

加权TF-IDF特征向量中的特定特征以实现k均值聚类和余弦相似性

[英]Weighting specific features in TF-IDF feature vectors for k-means clustering and cosine similarity

原文 2015-09-22 14:17:06 9 1 python/ machine-learning/ scikit-learn/ k-means/ tf-idf

I have an array of TF-IDF feature vectors. 我有一个TF-IDF特征向量数组。 I'd like to find similar vectors in the array using two methods: 我想使用两种方法在数组中找到相似的向量：

Cosine similarity 余弦相似度
k-means clustering k均值聚类

Using Scikit Learn, this process is pretty simple. 使用Scikit Learn，此过程非常简单。

Now I'd like to weight certain features so that they will influence the results more than the other features. 现在，我想对某些功能进行加权，以便它们比其他功能对结果的影响更大。 For example, I might like to weight the first 100 elements of the TF-IDF vectors so that those features are more indicative of similarity than the rest of the features. 例如，我可能想加权TF-IDF向量的前100个元素，以使这些特征比其余特征更能指示相似性。

How can I meaningfully weight certain features in my feature vectors? 如何有效地对特征向量中的某些特征进行加权？ Is the process for weighting certain features the same for each of the similarity algorithms I listed above? 对于上面列出的每个相似度算法，对某些特征进行加权的过程是否相同？

1 个解决方案

As I understand, low values in the TFIDF matrix mean that the words are less significant. 据我了解，TFIDF矩阵中的低值表示单词的重要性较低。 So one approach is to lower the values in the matrix for those columns you considered. 因此，一种方法是降低您考虑的那些列的矩阵中的值。

The arrays in scikit are sparse, so for testing and debugging you might want to convert to regular matrix. scikit中的数组是稀疏的，因此对于测试和调试，您可能需要转换为常规矩阵。 I also used xlsxwriter to get an overview to what is really happening when applying TFIDF and KMeans++ (see) https://www.dbc-enterprise-it-consulting.com/text-classifier/ . 我还使用xlsxwriter概述了应用TFIDF和KMeans ++时实际发生的情况（请参阅https://www.dbc-enterprise-it-consulting.com/text-classifier/）。