简体   繁体   English

使用sklearn tf-idf查找矢量化文本文档中的簇数

[英]finding the number of clusters in a vectorized text document with sklearn tf-idf

I'm trying to cluster dialogs using sklearn tf-idf and k-means . 我正在尝试使用sklearn tf-idfk-means聚集对话框。 I calculated the optimal number of clusters using a silhouette score, but it increases almost linearly. 我使用轮廓分数计算了最佳聚类数,但它几乎呈线性增加。 So, are there other ways or maybe I'm doing something wrong? 那么,还有其他方法吗,或者我做错了什么?

Code: 码:

tfidfV = TfidfVectorizer(max_features = 40000, ngram_range = ( 1, 3 ), sublinear_tf = True)
...
X = tfidfV.fit_transform(docm2)
...
for numb in nn:
    km = KMeans(n_clusters=numb)
    clabels = km.fit_predict(X)
    silhouette_avg = silhouette_score(X, clabels)
    print("For n_clusters = ", numb, "The average silhouette_score is: ", silhouette_avg)

The underlying problem is much more severe, and there is no easy solution: 根本的问题要严重得多,并且没有简单的解决方案:

K-means is very sensitive to outliers. K均值对异常值非常敏感。 But in typical text data, there are plenty of outliers. 但是在典型的文本数据中,有很多离群值。 Most documents are in one way or another unusual. 大多数文档都以一种或另一种不同的方式出现。 Because of this, the "best" solution is to put all non-duplicate points in their own cluster, ie use an absurdly large k. 因此,“最佳”解决方案是将所有非重复的点放在它们自己的群集中,即,使用一个非常大的k。 Not only does this drastically increase the runtime, it also makes the result pretty much useless unless you are in a very much idealized scenario like 20newsgroups. 除非您处于非常理想的场景(例如20newsgroups),否则这不仅会大大增加运行时间,而且还会使结果几乎毫无用处。

So use topic modeling or similar algorithms that work better in this scenario. 因此,请使用在这种情况下效果更好的主题建模或类似算法。 But I do not have any recommendation for alternative clusterings. 但是我对替代群集没有任何建议。 None seems to work well enough to be of general usefulness without endless parameter choosing. 没有无休止的参数选择,似乎没有一种方法能很好地发挥作用,具有普遍的用途。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM