简体   繁体   中英

R Using Purrr Map function to calculate Silhouette distances of KMeans model

Simple KMeans model

Data

library(tidyverse)
library(broom)
library(cluster)
set.seed(42)
data = tibble(x = c(rnorm(20, 5, 1), rnorm(20, 10, 3)),
              y = c(rnorm(20, 5, 1), rnorm(20, 10, 3)))

Model

kmeans_k2 <- kmeans(data, 2, 9)

Tidymodel Results

tidy(kmeans_k2)
# A tibble: 2 x 5
      x     y  size withinss cluster
  <dbl> <dbl> <int>    <dbl> <fct>  
1  9.60 10.8     18    255.  1      
2  5.22  5.20    22     76.6 2      
glance(kmeans_k2)
# A tibble: 1 x 4
  totss tot.withinss betweenss  iter
  <dbl>        <dbl>     <dbl> <int>
1  836.         331.      505.     1
augment(kmeans_k2, data)
# A tibble: 40 x 3
       x     y .cluster
   <dbl> <dbl> <fct>   
 1  6.37  5.21 2       
 2  4.44  4.64 2       
 3  5.36  5.76 2       
 4  5.63  4.27 2       
 5  5.40  3.63 2       
 6  4.89  5.43 2       
 7  6.51  4.19 2       
 8  4.91  6.44 2       
 9  7.02  4.57 2       
10  4.94  5.66 2       
# ... with 30 more rows

Silhouette Calculation

sil_k2 <- silhouette(kmeans_k2$cluster, dist(data))
summary(sil_k2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = kmeans_k2$cluster, dist = dist(data)) :
 Cluster sizes and average silhouette widths:
       18        22 
0.3527383 0.7053056 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04243  0.45422  0.58233  0.54665  0.74472  0.79591 
summary(sil_k2)$si.summary[4]
     Mean 
0.5466503 

Nested tibble using Purrr's map function to create results for multiple different k's

kmeans_k123 <- tibble(k = 1:3) %>%
  mutate(km_model = map(k, ~kmeans(data, .x)),
         tidydata = map(km_model, tidy),
         glancedata = map(km_model, glance),
         augmentdata = map(km_model, augment, data))
kmeans_k123
# A tibble: 3 x 5
      k km_model tidydata         glancedata       augmentdata      
  <int> <list>   <list>           <list>           <list>           
1     1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
2     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
3     3 <kmeans> <tibble [3 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
pluck(kmeans_k23, 3, 2)
# A tibble: 2 x 5
      x     y  size withinss cluster
  <dbl> <dbl> <int>    <dbl> <fct>  
1  5.63  5.10    26     117. 1      
2 11.3  10.7     14     176. 2      

Question is, How do I add the Silhouette score to the nested tibble? The Silhouette function needs the clusters of each model and I'm not sure how to do that. Obviously I can pluck out a single instance, such as

data_k2cluster <- pluck(kmeans_k123, 2, 2)$cluster
data_k2cluster
 [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
sil_k2v2 <- silhouette(data_k2cluster, dist(data))
summary(sil_k2v2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = data_k2cluster, dist = dist(data)) :
 Cluster sizes and average silhouette widths:
       18        22 
0.3527383 0.7053056 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04243  0.45422  0.58233  0.54665  0.74472  0.79591 

But when I try and use that with map it doesn't work

kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, silhouette))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, silhouette)`.
i Input `sildata` must be size 3 or 1, not 0.

I can create a function, which again works for a single occurence

my_fn <- function(f_cluster, f_data){my_fn <- silhouette(f_cluster, dist(f_data))}
summary(my_fn(kmeans_k2$cluster, data))
Silhouette of 40 units in 2 clusters from silhouette.default(x = f_cluster, dist = dist(f_data)) :
 Cluster sizes and average silhouette widths:
       18        22 
0.3527383 0.7053056 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04243  0.45422  0.58233  0.54665  0.74472  0.79591 

But fails when I use it with map.

kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, my_fn))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, my_fn)`.
i Input `sildata` must be size 3 or 1, not 0.

I suspect that the problem is related to how I am trying to retrieve $cluster from the nested models as I've tried extracting that to create it's own column, but can't get that to work either.

Putting it as an answer, because comments don't really allow for a ton of code.

The following works for me:

kmeans_k123 <- tibble(k = 1:3) %>%
  mutate(km_model = map(k, ~kmeans(data, .x)),
         tidydata = map(km_model, tidy),
         glancedata = map(km_model, glance),
         augmentdata = map(km_model, augment, data),
         silhouettedata = map(augmentdata, ~ silhouette(as.numeric(levels(.x$.cluster))[.x$.cluster], dist(data))))

unnest(kmeans_k123, silhouettedata)

# A tibble: 81 x 6
       k km_model tidydata         glancedata       augmentdata       silhouettedata[,"cluster"] [,"neighbor"] [,"sil_width"]
   <int> <list>   <list>           <list>           <list>                                 <dbl>         <dbl>          <dbl>
 1     1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                         NA            NA         NA    
 2     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.743
 3     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.776
 4     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.772
 5     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.771
 6     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.742
 7     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.794
 8     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.723
 9     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.713
10     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.683

About the usage of as.numeric(levels(.x$.cluster))[.x$.cluster] , this is because broom::tidy() turns the cluster variable into a factor, and cluster::silhouette() requires the cluster variable to be numeric. This answer provides why you use that particular line of code to convert a factor of numbers into a numeric variable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM