简体   繁体   中英

finding the number of clusters in a vectorized text document with sklearn tf-idf

I'm trying to cluster dialogs using sklearn tf-idf and k-means . I calculated the optimal number of clusters using a silhouette score, but it increases almost linearly. So, are there other ways or maybe I'm doing something wrong?

Code:

tfidfV = TfidfVectorizer(max_features = 40000, ngram_range = ( 1, 3 ), sublinear_tf = True)
...
X = tfidfV.fit_transform(docm2)
...
for numb in nn:
    km = KMeans(n_clusters=numb)
    clabels = km.fit_predict(X)
    silhouette_avg = silhouette_score(X, clabels)
    print("For n_clusters = ", numb, "The average silhouette_score is: ", silhouette_avg)

The underlying problem is much more severe, and there is no easy solution:

K-means is very sensitive to outliers. But in typical text data, there are plenty of outliers. Most documents are in one way or another unusual. Because of this, the "best" solution is to put all non-duplicate points in their own cluster, ie use an absurdly large k. Not only does this drastically increase the runtime, it also makes the result pretty much useless unless you are in a very much idealized scenario like 20newsgroups.

So use topic modeling or similar algorithms that work better in this scenario. But I do not have any recommendation for alternative clusterings. None seems to work well enough to be of general usefulness without endless parameter choosing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM