简体   繁体   English

可变长度在R中不同

[英]variable lengths differ in R

I am getting the error above when trying to use the cv.lm fucntion. 尝试使用cv.lm功能时,出现上述错误。 Please see my code 请看我的代码

sample<-read.csv("UU2_1_lung_cancer.csv",header=TRUE,sep=",",na.string="NA")
  sample1<-sample[2:2000,3:131]
  samplex<-sample[2:50,3:131]
  y<-as.numeric(sample1[1,]) 
  y<-as.numeric(sample1[2:50,2]) 
  x1<-as.numeric(sample1[2:50,3])
  x2<-as.numeric(sample1[2:50,4])
  x11<-x1[!is.na(y)]
  x12<-x2[!is.na(y)]
  y<-y[!is.na(y)]
  fit1 <- lm(y ~ x11 + x12, data=sample)
  fit1
  x3<-as.numeric(sample1[2:50,5])
  x4<-as.numeric(sample1[2:50,6])
  x13<-x3[!is.na(y)]
  x14<-x4[!is.na(y)]
  fit2 <- lm(y ~ x11 + x12 + x13 + x14, data=sample)
  anova(fit1,fit2)
  install.packages("DAAG")
  library("DAAG")
  cv.lm(df=samplex, fit1, m=10) # 3 fold cross-validation

Any insight will be appreciated. 任何见识将不胜感激。

Example of data
ID       peak height     LCA001 LCA002  LCA003
N001786 32391.111   0.397   0.229   -0.281
N005356 32341.473   0.397   -0.655  -1.301
N002416 32215.474   -0.703  -0.214  -0.901
GS239   31949.777   0.354   0.118   0.272
N016343 31698.853   0.226   0.04    -0.006
N003255 31604.978   0.024    NA -0.534
N004358 31356.597   -0.252  -0.022  -0.407
N000122 31168.09    -0.487  -0.533  -0.134
GS10564 31106.103   -0.156  -0.141  -1.17
GS17987 31043.876    NA     0.253   0.553
N003674 30876.207   0.109   0.093   0.07

Please see the example of the data above 请参阅上面的数据示例

First, you are using lm(..) incorrectly, or at least in a very unconventional way. 首先,您使用lm(..)不正确,或者至少以非常不常规的方式。 The purpose of specifying the data=sample argument is so that the formula uses references to columns of the sample . 指定data=sample参数的目的是使公式使用对sample列的引用。 Generally, it is a very bad practice to use free-standing data in the formula reference. 通常,在公式参考中使用独立数据是非常不好的做法

So try this: 所以试试这个:

## not tested...
sample <- read.csv(...)
colnames(sample)[2:6] <- c("y","x1","x2","x3","x4")
fit1 <- lm(y~x1+x2, data=sample[2:50,],na.action=na.omit)
library(DAAG)
cv.lm(df=na.omit(sample[2:50,]),fit1,m=10)

This will give columns 2:6 the appropriate names and then use those in the formula. 这将为2:6列提供适当的名称,然后在公式中使用这些名称。 The argument na.action=na.omit tells the lm(...) function to exclude all rows where there is an NA value in any of the relevant columns. 参数na.action=na.omit告诉lm(...)函数排除任何相关列中具有NA值的所有行。 This is actually the default, so it is not needed in this case, but included for clarity. 这实际上是默认设置,因此在这种情况下不需要它,但为清楚起见将其包括在内。

Finally, cv.lm(...) uses it's second argument to find the formula definition, so in your code: 最后, cv.lm(...)使用它的第二个参数来查找公式定义,因此在您的代码中:

cv.lm(df=samplex, fit1, m=10)

is equivalent to: 等效于:

cv.lm(df=samplex,y~x11+x12,m=10)

Since there are (presumeably) no columns named x11 and x12 in samplex , and since you define these vectors externally, cv.lm(...) throws the error you are getting. 由于samplex中没有(可能)没有名为x11x12samplex ,并且由于您在外部定义了这些向量,因此cv.lm(...)会引发您所得到的错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM