简体   繁体   中英

Number of distinct clusters in KMeans is less than n_clusters?

I have some food images stored in a single folder. All the images are unlabeled, nor are they stored into separate folder such as "pasta" or "meat". My current goal is to cluster the images into a number of categories so that I can later assess if the taste of foods depicted in images of the same cluster is similar.

To do that, I load the images and process them in a format that can be fed into the VGG16 for feature extraction and then pass the features to my KMeans to cluster the images. The code I am using is:

path = r'C:\Users\Hi\Documents\folder'
train_dir = os.path.join(path)
model = VGG16(weights='imagenet', include_top=False)
vgg16_feature_list = []
files = glob.glob(r'C:\Users\Hi\Documents\folder\*.jpg')
for i in enumerate(files):
    img = image.load_img(img_path,target_size=(224,224))
    img_data=image.img_to_array(img)
    img_data=np.expand_dims(img_data,axis=0)
    img_data=preprocess_input(img_data)

    vgg16_feature = model.predict(img_data)
    vgg16_feature_np = np.array(vgg16_feature)
    vgg16_feature_list.append(vgg16_feature_np.flatten())
vgg16_feature_list_np=np.array(vgg16_feature_list)
print(vgg16_feature_list_np.shape)
print(vgg16_feature_np.shape)

kmeans = KMeans(n_clusters=3, random_state=0).fit(vgg16_feature_list_np)
print(kmeans.labels_)

The issue is that I get the following warning:

ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (3). Possibly due to duplicate points in X. 

How can I fix that?

This is one of these situations where, although your code is fine from a programming point of view, it does not produce satisfactory results due to an ML -related issue (data, model, or both), hence it is rather difficult to "debug" (I'm quoting the word, since this is not the typical debugging procedure, as the code itself runs fine).

At first instance, the situation seems to imply that there is not enough diversity in your features to justify 3 different clusters. And, provided that we remain in a K-means context, there is not much you can do; among the few options available (refer to the documentation for details of the respective parameters):

  • Increase the number of iterations max_iter (default 300)
  • Increase the number of different centroid initializations n_init (default 10)
  • Change the init argument to random (the default is k-means++ ) or, even better, provide a 3-element array with one sample from each of your targeted clusters (if you already have an idea which these clusters may actually be in your data)
  • Run the model with different random_state values
  • Combine the above

If nothing of the above works, it would very likely mean that K-means is actually not applicable here, and you may have to look for alternative approaches (which are out of the scope of this thread). Truth is, as correctly pointed out in the comment below, K-means does not usually work that well with data of such high dimensionality.

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    cluster_data(data_arr)

You can use this function to remove the warning's. As sklearn uses warnings module to remove warnings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM