简体   繁体   English

如何在R中的1664个外植体变量上拟合多元线性回归模型

[英]How to fit a multitple linear regression model on 1664 explantory variables in R

I have one response variable, and I'm trying to find a way of fitting a multiple linear regression model using 1664 different explanatory variables. 我有一个响应变量,我试图找到一种使用1664个不同的解释变量拟合多元线性回归模型的方法。 I'm quite new to R and was taught the way of doing this by stating the formula using each of the explanatory variables in the formula. 我对R很陌生,并通过使用公式中的每个解释变量来说明公式,从而教会了我这样做的方法。 However as I have 1664 variables, it would take too long to do. 但是,由于我有1664个变量,因此需要花费太长时间。 Is there a quicker way of doing this? 有更快的方法吗?

Thank you! 谢谢!

I think you want to select from the 1664 variables a valid model, ie a model that predicts as much of the variability in the data with as few explanatory variables. 我认为您想从1664个变量中选择一个有效的模型,即,一个模型,它可以预测数据中的可变性,而解释性变量则少。 There are several ways of doing this: 有几种方法可以做到这一点:

  • Using expert knowledge to select variables that are known to be relevant. 使用专家知识来选择已知相关的变量。 This can be due to other studies finding this, or due to some underlying process that you now makes that variable relevant. 这可能是由于其他研究发现了此结果,也可能是由于您现在使该变量具有相关性的一些基本过程。
  • Using some kind of stepwise regression approach which selects the variables are relevant based on how well they explain the data. 使用某种逐步回归方法来选择变量,这取决于变量对数据的解释程度。 Do note that this method has some serious downsides. 请注意,此方法有一些严重的缺点。 Have a look at stepAIC for a way of doing this using the Aikaike Information Criterium. 看看stepAIC了解使用Aikaike信息标准进行此操作的方法。

Correlating 1664 variables with data will yield around 83 significant correlations if you choose a 95% significance level (0.05 * 1664) purely based on randomness. 如果纯粹基于随机性选择95%的显着性水平(0.05 * 1664),则将1664变量与数据相关将产生约83个显着相关。 So, tread carefully with the automatic variable selection. 因此,请谨慎选择自动变量选择。 Cutting down the amount of variables with expert knowledge or some decorrelation techniques (eg principal component analysis) would help. 用专业知识或一些去相关技术(例如主成分分析)减少变量的数量将有所帮助。

For a code example, you first need to include an example of your own (data + code) on which I can build. 对于一个代码示例,您首先需要包含一个自己的示例(数据+代码),我可以在上面构建该示例。

I'll answer the programming question, but note that often a regression with that many variables could use some sort of variable selection procedure (eg @PaulHiemstra's suggestions). 我将回答编程问题,但请注意,具有这么多变量的回归通常可以使用某种变量选择过程(例如@PaulHiemstra的建议)。

  1. You can construct a data.frame with only the variables you want to run, then use the formula shortcut: form <- y ~ . 您可以仅使用要运行的变量来构造一个data.frame,然后使用公式快捷方式: form <- y ~ . -y〜 form <- y ~ . , where the dot indicates all variables not yet mentioned. ,其中的点表示尚未提及的所有变量。
  2. You could instead construct the formula manually. 您可以改为手动构造公式。 For instance: form <- as.formula( paste( "y ~", paste(myVars,sep="+") ) ) 例如: form <- as.formula( paste( "y ~", paste(myVars,sep="+") ) )

Then run your regression: 然后运行回归:

lm( form, data=dat )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM