简体   繁体   中英

PCA clustering and analysis of clusters in R

I am trying to perform a PCA on a dataset which contains a survey results. The survey was conducted on companies (companies are in rows) and they were asked multiple questions (questions and answers are in columns). Most of the questions were based on a pattern "Please choose an answer from a set X of answers X = {1,2,3,4...}. There are some boolean values but a good share of answers has more variation.

What I would like to do is to reduce the dimensions and look for the similarities among the companies. For this purpose I would like to perform a PCA.

The dataset I will be using can be downloaded from: https://www.kaggle.com/jakubdbrowski/datapca

datapca <- read.csv2("datapca.csv")
datapca <- datapca[,-c(1)]

I need to drop the first column which does not have any information. The dataset was cleaned and prepared beforehand. Now I can perform a PCA.

xxx.pca <- prcomp(datapca, center = TRUE, scale.= TRUE)

Now I would like to look for the numbers of clusters I could get from my data.

fviz_nbclust(xxx.pca$x, FUNcluster=kmeans, k.max = 8)

It looks like it could be difficult to find clusters in this particular dataset.

hopkins(datapca, n=nrow(xxx.pca$x)-1) 

However, I would like to continue the analysis to go through the whole analytical process. Once I will receive the updated data, maybe the results will be better.

So I will create two clusters as suggested.

km1<-eclust(xxx.pca$x, "kmeans", hc_metric="eucliden",k=2)

And at this point comes my question. Right now, I would like to try to look at the clusters and determine which loadings are responsible for clustering and characterize the two clusters?

I would also like to ask, whether it is possible to determine the most important loadings, reduce their number (right now there are 150 which makes the graph too complicated) and plot them in a clearer way? Both graphs below are to messy.

fviz_pca_var(xxx.pca, col.var = "black")
biplot(xxx.pca, showLoadings = TRUE, lab = NULL)

Thank you very much in advance!

The first 2 PCs explain about 23% of the variation in the data. The first 13 explain about 50% of the data and the first 26 explain 66%. You need to decide how many components are meaningful.

xxx.comp <- summary(xxx.pca)
xxx.comp$importance[, c(2, 13, 26)]
#                             PC2     PC13     PC26
# Standard deviation     3.679506 1.527898 1.225693
# Proportion of Variance 0.090260 0.015560 0.010020
# Cumulative Proportion  0.233080 0.497630 0.658610

Plotting the first two components shows some clustering:

plot(xxx.pca$x[, 1:2], pch=20)

阴谋

You can start by identifying clusters in 2 dimensions, see if they make sense and then increase the number of dimensions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM