简体   繁体   中英

Principal Component Analysis throws constant/zero column Error

I'm trying to run a PCA on the "training1" data set created below:

library(AppliedPredictiveModeling); data(AlzheimerDisease); library(caret)

adData <- data.frame(diagnosis, predictors)
inTrain <- createDataPartition(y = adData$diagnosis, p = .75)[[1]]
training <- adData[inTrain, ]
keep <- subset(data.frame(x = substr(as.character(colnames(training)), 1, 2), y = c(1:ncol(training))), x == "IL")
training1 <- cbind(training[, c(keep[1, 2]:keep[nrow(keep), 2])], training[c("diagnosis")])

Then, when I run the following function:

preProc <- preProcess(log10(training1[, -13]+1), method = "pca", pcaComp = 2)

I get the following error:

Warning in preProcess.default(log10(training1[, -13] + 1), method = "pca",  :
  Std. deviations could not be computed for: IL_1alpha, IL_3
Error in prcomp.default(x[, method$pca, drop = FALSE], scale = TRUE, retx = FALSE) : 
  cannot rescale a constant/zero column to unit variance

However, I then run run the following two functions to prove that standard deviations can be calculated for the two variables it says that it can't calculate them for:

sd(training1$IL_1alpha)
[1] 0.4056147
sd(training1$IL_3)
[1] 0.5235212

And then run the following function to prove that I do not have any variables with a zero variance.

nsv <- nearZeroVar(training1, saveMetrics = TRUE)
> print(nsv)
              freqRatio percentUnique zeroVar   nzv
IL_11          1.250000    29.4820717   FALSE FALSE
IL_13          1.052632     6.7729084   FALSE FALSE
IL_16          1.117647    21.9123506   FALSE FALSE
IL_17E         1.238095    16.7330677   FALSE FALSE
IL_1alpha      1.208333    23.1075697   FALSE FALSE
IL_3           1.066667    24.7011952   FALSE FALSE
IL_4           1.315789    19.1235060   FALSE FALSE
IL_5           1.000000    19.5219124   FALSE FALSE
IL_6           1.000000    20.3187251   FALSE FALSE
IL_6_Receptor  1.041667    21.5139442   FALSE FALSE
IL_7           1.611111    18.7250996   FALSE FALSE
IL_8           1.000000    22.3107570   FALSE FALSE
diagnosis      2.637681     0.7968127   FALSE FALSE

It seems like other people's issues with PCA in R were around zero variance columns, but since I can prove that I don't have that issue here, any ideas what may be causing the issue?

Sorry, I don't have the rep to comment, so posting as an answer, but after running your code, in particular this line:

 log10(training1[, -13]+1)    

returns NaN values in some columns ( IL_1alpha and IL_3 actually):

 Warning messages:
 1: In lapply(X = x, FUN = .Generic, ...) : NaNs produced

So that seems to be the source of the error. Maybe you shouldn't take log's of negative numbers and think of other transformation instead (or whether it is necessary at all)?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM