简体   繁体   English

加权TF-IDF特征向量中的特定特征以实现k均值聚类和余弦相似性

[英]Weighting specific features in TF-IDF feature vectors for k-means clustering and cosine similarity

I have an array of TF-IDF feature vectors. 我有一个TF-IDF特征向量数组。 I'd like to find similar vectors in the array using two methods: 我想使用两种方法在数组中找到相似的向量:

  1. Cosine similarity 余弦相似度
  2. k-means clustering k均值聚类

Using Scikit Learn, this process is pretty simple. 使用Scikit Learn,此过程非常简单。

Now I'd like to weight certain features so that they will influence the results more than the other features. 现在,我想对某些功能进行加权,以便它们比其他功能对结果的影响更大。 For example, I might like to weight the first 100 elements of the TF-IDF vectors so that those features are more indicative of similarity than the rest of the features. 例如,我可能想加权TF-IDF向量的前100个元素,以使这些特征比其余特征更能指示相似性。

How can I meaningfully weight certain features in my feature vectors? 如何有效地对特征向量中的某些特征进行加权? Is the process for weighting certain features the same for each of the similarity algorithms I listed above? 对于上面列出的每个相似度算法,对某些特征进行加权的过程是否相同?

As I understand, low values in the TFIDF matrix mean that the words are less significant. 据我了解,TFIDF矩阵中的低值表示单词的重要性较低。 So one approach is to lower the values in the matrix for those columns you considered. 因此,一种方法是降低您考虑的那些列的矩阵中的值。

The arrays in scikit are sparse, so for testing and debugging you might want to convert to regular matrix. scikit中的数组是稀疏的,因此对于测试和调试,您可能需要转换为常规矩阵。 I also used xlsxwriter to get an overview to what is really happening when applying TFIDF and KMeans++ (see) https://www.dbc-enterprise-it-consulting.com/text-classifier/ . 我还使用xlsxwriter概述了应用TFIDF和KMeans ++时实际发生的情况(请参阅https://www.dbc-enterprise-it-consulting.com/text-classifier/)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Scikit学习K-means聚类和TfidfVectorizer:如何将tf-idf得分最高的前n个术语传递给k-means - Scikit Learn K-means Clustering & TfidfVectorizer: How to pass top n terms with highest tf-idf score to k-means 使用Apache Spark中的K-means进行tf-idf文档聚类,将点放入一个聚类 - tf-idf document clustering with K-means in Apache Spark putting points into one cluster 无需tf-idf预处理就可以对文本数据进行K-均值聚类 - K-means clustering on text data without tf-idf preprocessing 使用 TF-IDF 的 K-Means 中的 Plot 质心 - Plot centroids in K-Means using TF-IDF Scipy,TF-IDF和余弦相似度 - Scipy, tf-idf and cosine similarity 使用具有余弦相似度的 K 均值 - Python - Using K-means with cosine similarity - Python 归一化基于tf-idf计算的余弦相似度值 - Normalize cosine similarity values calculated based on tf-idf TF-IDF 和余弦相似度的模糊匹配不够准确 - Fuzzy matching not accurate enough with TF-IDF and cosine similarity 使用 TF-IDF 和余弦相似度匹配短语 - Matching phrase using TF-IDF and cosine similarity 在 Python 中使用 TF-IDF、NGrams 和 Cosine Similarity 进行字符串匹配 - String Matching Using TF-IDF, NGrams and Cosine Similarity in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM