简体   繁体   中英

K-Means Clustering - output clusters contains same number of elements but in different order [ Python ]

I followed this tutorial to perform K - Means clustering for a list containing individual words. This is a cricket based project so I picked K = 3 so that I can differentiate the three clusters into [ batting,bowling,fielding ] later. But, after compiling the code, the elements in resultant 3 clusters are all the same but in different order. I tried to make the initial list distinct but it also couldn't solve the problem. Attaching the code below.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

len(finaldatatext)
#2173
vectorizer = TfidfVectorizer(stop_words='english')
#finaldatatext here is the list containing distinct elements
X = vectorizer.fit_transform(finaldatatext)

true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

clusterlists = []
for i in range(true_k):
  dummy_list  = []
    for ind in order_centroids[i]:
      #print( '%s' % terms[ind])
      dummy_list.append('%s' % terms[ind])
  clusterlists.append(dummy_list)

The sample initial list is:

['anymore', 'silly', 'fielders', 'fans', 'rcb', 'precedent', 'reputation', 'pool', 'International', 'famous', 'Astle', 'max', 'stadium', 'bennet', 'working', 'lassi', 'ameetasinh', 'meantime', 'com', 'on', 'little', 'saini', 'Kanos', 'telling', 'six', 'PrithviShaw', 'started', 'letting', 'wYB2P72Il2', 'chess', 'brainwashed', 'Stat', 'mediocre', 'Afridi', 'hopes', 'strength', 'jamieson', 'managed', '46th', 'finale', 'PaRtNeRShIP', 'Another', 'kind', 'exactly', 'Happybirthday', 'out', 'RidaNajamKhan', 'scoreline', 'Career', 'boiiiiiiiiiiiii', 'based', 'starting', 'Test', 'omnipresent', 'Hahaha', 'version', 'victory', 'desert', 'cowards', 'OUTDATED', 'nz', 'inspecting', 'honestly', 'wait', 'Unless', 'steadying', 'think', 'anyone', 'YER', 'rant', 'one', 'odis', 'BANTER', 'paav', 'Ug6cTFgG8U', 'aggressive', 'brought', 'workload', 'Wise', 'ca', 'Brilliant', 'twist', 'open', 'THROWS', 'bringing', 'till', 'starts', 'gives', 'wYB', 'fifty', 'SENA', 'baboon', 'punishment', 'summarized', 'feeling', 'pandya', 'Bangladesh', 'hurting', 'accent', 'Kid', 'well']

The expected result is three distinct clusters having unique values that I can classify into batting, bowling and fielding according to the elements. Currently it is 3 identically same clusters in different order.

print(Clusterlists[0])
#sample reduced result
['absence', 'zize6kysq2', 'flexibility', 'finally', 'finals', 'fined', 'finisher', 'firepower', 'fit', 'fitness', 'flaw', 'flaws', 'fleming', 'fluffed', 'frame', 'fluke', 'fn0uegxgss', 'focussed', 'foot', 'forget', 'forgot', 'form', 'format', 'forward', 'fought', 'fow', 'finale', 'final', 'filter', 'figures', 'fashioned', 'fast', 'fastest', 'fat', 'fatigue', 'fault', 'fav', 'featured', 'feel', 'feeling', 'feels', 'fees', 'feet', 'felt', 'ferguson', 'fewest', 'ffc4pfbvfr', 'ffs', 'field', 'fielder', 'fielders', 'fielding', 'fight', 'fow_hundreds', 'frankly', 'faridabad', 'given', 'giving', 'glad', 'glenn', 'gloves', 'god', 'gods', 'goes', 'going', 'gois', 'gon', 'gone', 'good', 'got', 'grand', 'grandhomme', 'grandmom', 'grandpa', 'grass', 'great', 'greatest', 'greatness', 'greig', 'grind', 'gives', 'gingers', 'free', 'gill', 'frontline','fulfilling', 'future', 'gaandu', 'gabbar', 'gajal_dalmia', 'gambhir', 'game', 'gangsta', 'geez', 'gem', 'genius', 'genuinely', 'gets', 'getter', 'getting', 'giant', 'giddy', 'fascinating', 'fared', 'groupby', 'drives', 'dropped', 'drowning', 'dube', 'dude', 'dumb', 'dumbass', 'duo', 'e3cli7hakf', 'e9fhdkxvvl', 'earlier', 'early', 'earned', 'easiest', 'easily', 'easy', 'economically', 'economy', 'edengarden', 'edge']
len(Clusterlists[0])
#1728
len(Clusterlists[1])
#1728
len(Clusterlists[2])
#1728

currently gives the same value. Kindly provide a solution. Thanks in advance.

Link of initial finaldatatext list converted to csv.

Your "clusterlists" is only appended once at the end of the code. Try to correct the indentation of "clusterlists", it should be OK.

Also, the indentation in the original post seems off, too. Check the indentation after copy and paste.

A short time ago I tested some code to do clustering of text. It's somewhat unorthodox to calculate distances between text, but you can do it, if you really want to.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)

Just modify that to suit your specific needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM