简体   繁体   中英

How to perform a multivariate linear regression when y is an indicator matrix in r?

this is the first time I am posting a question, hope it looks not confusing. And thanks very much for your time.

I am working on a zipcode dataset, which can be downloaded here: http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.train.gz http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.test.gz

In general, my goal is to fit principle component regression model with the top 3 PCs on the train dataset for those response variable are the handwriting digits of 2, 3, 5, and 8, and then predict by using the test data. My main problem is that after performing PCA on the X matrix, I am not sure if I did the regression part correctly. I have turned the response variables into an 2487*4 indicator matrix, and want to fit a multivariate linear regression model. But the prediction results are not binomial indicators, so I am confused that how should I interpret the predictions back to the original response variables, ie, which are predicted as 2, 3, 5, or 8. Or did I do the regression part totally wrong? Here are my code as follows:

First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8:

zip_train <- read.table(gzfile("zip.train.gz")) 
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]    
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1

Second, I centered the X matrix, and calculated the top 3 principal components by using svd:

cxtrain <- scale(xtrain)
svd.xtrain <- svd(cxtrain)
cxtest <- scale(xtest)
svd.xtest <- svd(cxtest)

utrain.r3 <- svd.xtrain$u[,c(1:3)] # this is the u_r
vtrain.r3 <- svd.xtrain$v[,c(1:3)] # this is the v_r
dtrain.r3 <- svd.xtrain$d[c(1:3)]
Dtrain.r3 <- diag(x=dtrain.r3,ncol=3,nrow=3) # creat the diagonal matrix D with r=3
ztrain.r3 <- cxtrain %*% vtrain.r3 # this is the scores, the new components

utest.r3 <- svd.xtest$u[,c(1:3)] 
vtest.r3 <- svd.xtest$v[,c(1:3)] 
dtest.r3 <- svd.xtest$d[c(1:3)]
Dtest.r3 <- diag(x=dtest.r3,ncol=3,nrow=3) 
ztest.r3 <- cxtest %*% vtest.r3 

Third, which is the part I was not sure if I did in the correct way, I turned the response variables into an indicator matrix, and performed a multivariate linear regression like this:

ytrain.ind <-cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)

mydata <- data.frame(cbind(ztrain.r3,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest.r3)
pred <- predict(model_train,newdata=new)

However, the pred was not an indicator matrix, so I am getting lost that how to interpret them back to the digits and compare them with the real test data to further calculate the prediction error.

I finally figured out how to perform multivariate linear regression with categorical y. First we need to turn the y into an indicator matrix, so then we could interpret the 0 and 1 in this matrix as probabilities. And then regress y on x to build a linear model, and finally use this linear model to predict with the test set of x. The result is a matrix with same dimensions as our indicator matrix. And all the entries should be interpreted as probabilities too, although they could be larger than 1 or smaller than 0 (that's why it confused me before). So we need to find the maximum number per row, to see which predicted y has the highest probability, and this y would be our final prediction. In this way, we could convert the continuous numbers back into categories, and then make a table to compare with the test set of y. So I updated my previous code as below.

First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8 (the code remains the same as the one I posted in my question):

zip_train <- read.table(gzfile("zip.train.gz")) 
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]    
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1

Second, I centered the X matrix, and calculated the top 3 principal components by using eigen(). I updated this part of code, because I standardized x instead of centering it in my previous code, leading to a wrong computation of the covariance matrix of x and eigenvectors of cov(x).

cxtrain <- scale(xtrain, center = TRUE, scale = FALSE) 
eigenxtrain <- eigen(t(cxtrain) %*% cxtrain / (nrow(cxtrain) -1)) # same as get eigen(cov(xtrain)), because I have already centered x before
cxtest <- scale(xtest, center = TRUE, scale = FALSE)
eigenxtest <- eigen(t(cxtest) %*% cxtest/ (nrow(cxtest) -1))
r=3 # set r=3 to get top 3 principles
vtrain <- eigenxtrain$vectors[,c(1:r)] 
ztrain <- scale(xtrain) %*% vtrain # this is the scores, the new componenets
vtest <- eigenxtrain$vectors[,c(1:r)] 
ztest <- scale(xtest) %*% vtest

Third, I turned the response variables into an indicator matrix, and performed a multivariate linear regression on the training set. And then use this linear model to predict.

ytrain.ind <- cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)

mydata <- data.frame(cbind(ztrain,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest)
pred<- predict(model_train,newdata=new)

The pred is a matrix with all the entries of probabilities, so we need to convert it back into a list of categorical y.

pred.ind <- matrix(rep(0,690*4),nrow=690,ncol=4) # build a matrix with the same dimensions as pred, and all the entries are 0.
for (i in 1:690){
  j=which.max(pred[i,]) # j is the column number of the highest probability per row
  pred.ind[i,j]=1 # we set 1 to the columns with highest probability per row, in this way, we could turn our pred matrix back into an indicator matrix
}

pred.col1=as.matrix(pred.ind[,1]*2) # first column are those predicted as digit 2
pred.col2=as.matrix(pred.ind[,2]*3)
pred.col3=as.matrix(pred.ind[,3]*5)
pred.col4=as.matrix(pred.ind[,4]*8)
pred.col5 <- cbind(pred.col1,pred.col2,pred.col3,pred.col4) 

pred.list <- NULL
for (i in 1:690){
  pred.list[i]=max(pred.col5[i,])
} # In this way, we could finally get a list with categorical y

tt=table(pred.list,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.3289855

For the third part, we could also perform a multinomial logistic regression instead. But in this way, we don't need to convert y into an indicator matrix, we just factor it. So the code looks as below:

library(nnet)
trainmodel <- data.frame(cbind(ztrain, ytrain))
mul <- multinom(factor(ytrain) ~., data=trainmodel) 
new <- as.matrix(ztest)
colnames(new) <- colnames(trainmodel)[1:r]
predict<- predict(mul,new)
tt=table(predict,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.2627907

So it showed that the logistic model do perform better than the linear model.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM