简体   繁体   English

如何在现有相关矩阵上运行 PCA,然后运行回归?

[英]How to run PCA on existing correlation matrix, then run regression?

I currently have calculated pairwise correlation between survey respondents, and stored it in a dataframe.我目前计算了调查受访者之间的成对相关性,并将其存储在数据框中。 It looks like this:它看起来像这样:

          person_1 person_2 person_3
 person_1.  0        1.5     1.8
 person_2.  1.5       0      2.2
 person_3.  1.8      2.2.      0

Now I'd like to run PCA analysis to find loadings for each response.现在我想运行 PCA 分析来查找每个响应的负载。 I have 2 questions:我有两个问题:

  1. Which function should I use to calculate PC using the correlation matrix directly?我应该使用哪个函数直接使用相关矩阵计算 PC?
  2. On a related note.在相关说明上。 I'd like to then regress each respondent's loading on the person's survey rating score in the original dataframe.然后,我想在原始数据框中对每个受访者的调查评分进行回归。 Is there a way for me to merge the "score" column back into the function to run regression?有没有办法让我将“分数”列合并回运行回归的函数? Or is there another way to do the regression/prediction?还是有另一种方法来进行回归/预测?

The original dataframe is a text dataframe and looks like this.原始数据框是一个文本数据框,看起来像这样。 I then run word mover distance between sentences to derive the correlation matrix.然后我运行句子之间的词移动距离来导出相关矩阵。

          text.                      score
person_1. I like working at Apple       2
person_2  the culture is great          -2
person_3. pandemic hits                 5

Thanks!谢谢!

As you have a matrix, sometimes most of known algorithms for PCA in R use to have issues with tolerance so they return error.由于您有一个矩阵,有时R大多数已知的 PCA 算法都存在容差问题,因此它们会返回错误。 I would suggest next approach using eigen() function which replicates the essence of PCA.我建议使用eigen()函数的下一种方法,它复制了 PCA 的本质。 Next the code:接下来是代码:

#Data
#Matrix
mm <- structure(c(0, 1.5, 1.8, 1.5, 0, 2.2, 1.8, 2.2, 0), .Dim = c(3L, 
3L), .Dimnames = list(c("person_1", "person_2", "person_3"), 
    c("person_1", "person_2", "person_3")))
#Scores
df1 <- structure(list(text. = c("I like working at Apple", "the culture is great", 
"pandemic hits"), score = c(2L, -2L, 5L)), row.names = c(NA, 
-3L), class = "data.frame")

The code for PCA would be next: PCA 的代码将是下一个:

#PCA
myPCA <- eigen(mm)
#Squares of sd computed by princomp
myPCA$values

Output:输出:

[1]  3.681925 -1.437762 -2.244163

In order to get loadings, we use this:为了获得加载,我们使用这个:

#Loadings
myPCA$vectors

Output:输出:

          [,1]       [,2]       [,3]
[1,] -0.5360029  0.8195308 -0.2026578
[2,] -0.5831254 -0.5329938 -0.6130925
[3,] -0.6104635 -0.2104444  0.7635754

With previous outputs we create a dataframe for regression:使用之前的输出,我们为回归创建了一个数据框:

#Format loadings 
Vectors <- data.frame(myPCA$vectors)
names(Vectors) <- colnames(mm)
#Prepare to regression
#Create data
mydf <- cbind(df1[,c('score'),drop=F],Vectors)

Output:输出:

  score   person_1   person_2   person_3
1     2 -0.5360029  0.8195308 -0.2026578
2    -2 -0.5831254 -0.5329938 -0.6130925
3     5 -0.6104635 -0.2104444  0.7635754

Finally the code for regressions would be this:最后回归的代码是这样的:

#Build models
lm(score~person_1,data=mydf)
lm(score~person_2,data=mydf)
lm(score~person_3,data=mydf)

Last models can be saved in new objects if you want.如果需要,可以将最后的模型保存在新对象中。 An example would be:一个例子是:

m1 <- lm(score~person_1,data=mydf)
summary(m1)

Output:输出:

Call:
lm(formula = score ~ person_1, data = mydf)

Residuals:
     1      2      3 
 1.411 -3.842  2.431 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -13.66      51.60  -0.265    0.835
person_1      -26.58      89.37  -0.297    0.816

Residual standard error: 4.76 on 1 degrees of freedom
Multiple R-squared:  0.08127,   Adjusted R-squared:  -0.8375 
F-statistic: 0.08846 on 1 and 1 DF,  p-value: 0.816

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM