简体   繁体   English

当y是r中的指标矩阵时,如何执行多元线性回归?

[英]How to perform a multivariate linear regression when y is an indicator matrix in r?

this is the first time I am posting a question, hope it looks not confusing. 这是我第一次发布问题,希望它看起来不会引起混淆。 And thanks very much for your time. 非常感谢您的宝贵时间。

I am working on a zipcode dataset, which can be downloaded here: http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.train.gz http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.test.gz 我正在研究一个邮政编码数据集,可以在这里下载: http : //statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.train.gz http://statweb.stanford.edu/~tibs/ElemStatLearn /datasets/zip.test.gz

In general, my goal is to fit principle component regression model with the top 3 PCs on the train dataset for those response variable are the handwriting digits of 2, 3, 5, and 8, and then predict by using the test data. 通常,我的目标是使主成分回归模型与火车数据集中的前3个PC对应,这些响应变量的手写数字为2、3、5和8,然后使用测试数据进行预测。 My main problem is that after performing PCA on the X matrix, I am not sure if I did the regression part correctly. 我的主要问题是在X矩阵上执行PCA之后,我不确定是否正确执行了回归部分。 I have turned the response variables into an 2487*4 indicator matrix, and want to fit a multivariate linear regression model. 我已将响应变量转换为2487 * 4指标矩阵,并希望拟合多元线性回归模型。 But the prediction results are not binomial indicators, so I am confused that how should I interpret the predictions back to the original response variables, ie, which are predicted as 2, 3, 5, or 8. Or did I do the regression part totally wrong? 但是预测结果不是二项式指标,因此我很困惑如何将预测解释回原始的响应变量,即预测为2、3、5或8。或者我是否完全做了回归部分错误? Here are my code as follows: 这是我的代码如下:

First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8: 首先,我用这些响应变量等于2、3、5和8的方法构建了子集:

zip_train <- read.table(gzfile("zip.train.gz")) 
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]    
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1

Second, I centered the X matrix, and calculated the top 3 principal components by using svd: 其次,我将X矩阵居中,并使用svd计算了前3个主要成分:

cxtrain <- scale(xtrain)
svd.xtrain <- svd(cxtrain)
cxtest <- scale(xtest)
svd.xtest <- svd(cxtest)

utrain.r3 <- svd.xtrain$u[,c(1:3)] # this is the u_r
vtrain.r3 <- svd.xtrain$v[,c(1:3)] # this is the v_r
dtrain.r3 <- svd.xtrain$d[c(1:3)]
Dtrain.r3 <- diag(x=dtrain.r3,ncol=3,nrow=3) # creat the diagonal matrix D with r=3
ztrain.r3 <- cxtrain %*% vtrain.r3 # this is the scores, the new components

utest.r3 <- svd.xtest$u[,c(1:3)] 
vtest.r3 <- svd.xtest$v[,c(1:3)] 
dtest.r3 <- svd.xtest$d[c(1:3)]
Dtest.r3 <- diag(x=dtest.r3,ncol=3,nrow=3) 
ztest.r3 <- cxtest %*% vtest.r3 

Third, which is the part I was not sure if I did in the correct way, I turned the response variables into an indicator matrix, and performed a multivariate linear regression like this: 第三,这是我不确定是否以正确的方式进行操作的部分,我将响应变量转换为指标矩阵,并执行了如下的多元线性回归:

ytrain.ind <-cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)

mydata <- data.frame(cbind(ztrain.r3,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest.r3)
pred <- predict(model_train,newdata=new)

However, the pred was not an indicator matrix, so I am getting lost that how to interpret them back to the digits and compare them with the real test data to further calculate the prediction error. 但是,该pred并不是指标矩阵,因此我迷失了如何将它们解释回数字并将它们与真实测试数据进行比较以进一步计算预测误差。

I finally figured out how to perform multivariate linear regression with categorical y. 我终于想出了如何使用类别y进行多元线性回归。 First we need to turn the y into an indicator matrix, so then we could interpret the 0 and 1 in this matrix as probabilities. 首先,我们需要将y转换为指标矩阵,然后才能将矩阵中的0和1解释为概率。 And then regress y on x to build a linear model, and finally use this linear model to predict with the test set of x. 然后对x进行y回归以建立线性模型,最后使用该线性模型对x的测试集进行预测。 The result is a matrix with same dimensions as our indicator matrix. 结果是一个尺寸与我们的指标矩阵相同的矩阵。 And all the entries should be interpreted as probabilities too, although they could be larger than 1 or smaller than 0 (that's why it confused me before). 并且所有条目也应解释为概率,尽管它们可能大于1或小于0(这就是为什么以前让我感到困惑的原因)。 So we need to find the maximum number per row, to see which predicted y has the highest probability, and this y would be our final prediction. 因此,我们需要找到每行的最大数目,以查看哪个预测的y具有最高的概率,而这个y将是我们的最终预测。 In this way, we could convert the continuous numbers back into categories, and then make a table to compare with the test set of y. 这样,我们可以将连续数字转换回类别,然后创建一个表与y的测试集进行比较。 So I updated my previous code as below. 所以我更新了我以前的代码,如下所示。

First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8 (the code remains the same as the one I posted in my question): 首先,我用这些响应变量等于2、3、5和8来构建子集(代码与我在问题中发布的代码相同):

zip_train <- read.table(gzfile("zip.train.gz")) 
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]    
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1

Second, I centered the X matrix, and calculated the top 3 principal components by using eigen(). 其次,我将X矩阵居中,并使用eigen()计算了前3个主要成分。 I updated this part of code, because I standardized x instead of centering it in my previous code, leading to a wrong computation of the covariance matrix of x and eigenvectors of cov(x). 我更新了这部分代码,因为我对x进行了标准化,而不是将其居中放置在先前的代码中,从而导致x的协方差矩阵和cov(x)的特征向量的计算错误。

cxtrain <- scale(xtrain, center = TRUE, scale = FALSE) 
eigenxtrain <- eigen(t(cxtrain) %*% cxtrain / (nrow(cxtrain) -1)) # same as get eigen(cov(xtrain)), because I have already centered x before
cxtest <- scale(xtest, center = TRUE, scale = FALSE)
eigenxtest <- eigen(t(cxtest) %*% cxtest/ (nrow(cxtest) -1))
r=3 # set r=3 to get top 3 principles
vtrain <- eigenxtrain$vectors[,c(1:r)] 
ztrain <- scale(xtrain) %*% vtrain # this is the scores, the new componenets
vtest <- eigenxtrain$vectors[,c(1:r)] 
ztest <- scale(xtest) %*% vtest

Third, I turned the response variables into an indicator matrix, and performed a multivariate linear regression on the training set. 第三,我将响应变量转化为指标矩阵,并对训练集进行了多元线性回归。 And then use this linear model to predict. 然后使用此线性模型进行预测。

ytrain.ind <- cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)

mydata <- data.frame(cbind(ztrain,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest)
pred<- predict(model_train,newdata=new)

The pred is a matrix with all the entries of probabilities, so we need to convert it back into a list of categorical y. pred是一个包含所有概率条目的矩阵,因此我们需要将其转换回类别y的列表。

pred.ind <- matrix(rep(0,690*4),nrow=690,ncol=4) # build a matrix with the same dimensions as pred, and all the entries are 0.
for (i in 1:690){
  j=which.max(pred[i,]) # j is the column number of the highest probability per row
  pred.ind[i,j]=1 # we set 1 to the columns with highest probability per row, in this way, we could turn our pred matrix back into an indicator matrix
}

pred.col1=as.matrix(pred.ind[,1]*2) # first column are those predicted as digit 2
pred.col2=as.matrix(pred.ind[,2]*3)
pred.col3=as.matrix(pred.ind[,3]*5)
pred.col4=as.matrix(pred.ind[,4]*8)
pred.col5 <- cbind(pred.col1,pred.col2,pred.col3,pred.col4) 

pred.list <- NULL
for (i in 1:690){
  pred.list[i]=max(pred.col5[i,])
} # In this way, we could finally get a list with categorical y

tt=table(pred.list,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.3289855

For the third part, we could also perform a multinomial logistic regression instead. 对于第三部分,我们也可以执行多项式逻辑回归。 But in this way, we don't need to convert y into an indicator matrix, we just factor it. 但是通过这种方式,我们不需要将y转换为指标矩阵,只需将其分解。 So the code looks as below: 因此,代码如下所示:

library(nnet)
trainmodel <- data.frame(cbind(ztrain, ytrain))
mul <- multinom(factor(ytrain) ~., data=trainmodel) 
new <- as.matrix(ztest)
colnames(new) <- colnames(trainmodel)[1:r]
predict<- predict(mul,new)
tt=table(predict,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.2627907

So it showed that the logistic model do perform better than the linear model. 因此,它表明逻辑模型的性能确实优于线性模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM