随机森林模型中使用递归特征消除的特征选择错误

Question

I have some hundred samples and I have already classified them into four different classes (clusters).我有几百个样本，我已经将它们分为四个不同的类（集群）。 Now, I'm interested in identifying the best set of genes that classify the samples into different classes.现在，我感兴趣的是确定将样本分类为不同类别的最佳基因集。

I want to apply randomforest with recursive feature elimination and detect the genes (features).我想应用具有递归特征消除的随机森林并检测基因（特征）。 My data looks like below.我的数据如下所示。 Just posting some example data here.只是在这里发布一些示例数据。

在此处输入图片说明

Above data is just an example: My original data is in dataframe df with 100 samples in first column and 4 classes in second column and columns 3 to column 1002 there are in total 1000 genes with expression values.上面的数据只是一个例子：我的原始数据在数据框df中，第一列有 100 个样本，第二列有 4 个类，第 3 列到第 1002 列共有 1000 个具有表达值的基因。

I'm using the below code but I see there is an error.我正在使用下面的代码，但我看到有一个错误。

library(caret)
library(mlbench)
library(Hmisc)
library(randomForest)

# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# run the RFE algorithm
results <- rfe(df[,3:1002], df[,2], sizes = df[,1:1002], rfeControl=control)

There is an Error: And I feel that I'm doing wrong somewhere.有一个错误：我觉得我在某处做错了。

Error in summary.connection(connection) : invalid connection

Here I'm giving the dput of the above data.这里我给出了上述数据的dput 。

df <-structure(list(Samples = structure(c(1L, 8L, 9L, 10L, 11L, 12L, 
13L, 14L, 15L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("Sample1", 
"Sample10", "Sample11", "Sample12", "Sample13", "Sample14", "Sample15", 
"Sample2", "Sample3", "Sample4", "Sample5", "Sample6", "Sample7", 
"Sample8", "Sample9"), class = "factor"), Class = structure(c(1L, 
2L, 3L, 1L, 2L, 4L, 2L, 1L, 1L, 4L, 1L, 3L, 4L, 1L, 1L), .Label = c("Class1", 
"Class2", "Class3", "Class4"), class = "factor"), Gene1 = c(1.030078784, 
0.944152632, 0.140700452, 0.013432323, 0.265233165, -0.084496727, 
4.835469554, 0.089434913, -0.433436179, 1.462895475, -0.116005356, 
1.007868422, 0.244881864, -1.495666899, 0.364368654), Gene2 = c(1.407236415, 
1.229003431, -0.322221459, -1.361955252, 0.310963955, 0.80115063, 
4.27765356, 0.872413223, -0.568249851, 1.187873069, -0.255284575, 
1.878058722, -0.767371822, -0.859697473, 0.057304769), Gene3 = c(0.200772234, 
-0.048349737, 1.224274924, 0.492396142, 0.500786902, -0.731802706, 
1.853246564, 1.611995455, 0.287088678, 0.509235514, 2.031735375, 
3.074950771, 2.069407179, 0.886158642, 1.736798303), Gene4 = c(1.23309207, 
1.321282889, 2.403301108, 0.748860637, 1.019200751, 1.393254607, 
2.667976275, 1.158136576, 1.89503732, 2.178257717, 0.747697632, 
2.834410716, 0.028594536, -0.411039831, 1.100167946), Gene5 = c(0.883005616, 
0.570786704, 0.72649548, 4.705893892, 0.086345885, 0.502530136, 
2.681497202, 0.640362079, 0.327319762, 2.086767741, 1.853085301, 
1.001799748, 0.126208601, 0.911621722, 0.671191951), Gene6 = c(2.590519025, 
3.076688902, 1.77414005, 1.014363629, 1.134652225, 2.71957962, 
4.696379063, -0.301828123, 1.214261665, 2.413881644, -0.470794827, 
0.520494891, 0.194511306, 0.075331863, 2.315680177), Gene7 = c(0.088929673, 
0.472549468, -0.125630236, -0.069648505, -0.715250242, 0.068554966, 
4.131662998, -0.075265565, -1.234425917, 0.343350342, 0.190414782, 
1.153495806, 0.210317581, -0.475603641, 0.294299351), Gene8 = c(2.112231178, 
2.780100532, 2.423828553, 1.569215682, 1.018119196, 2.583413401, 
6.483053565, 2.215201821, 1.893325529, 2.342058862, 4.001423142, 
4.221704757, 1.978211867, 1.452633851, 2.556589741)), class = "data.frame", row.names = c(NA, 
-15L))

Can anyone please tell me how I can use the above data and apply random forest to know which genes classify the samples into different classes.谁能告诉我如何使用上述数据并应用随机森林来了解哪些基因将样本分为不同的类别。 thanq.谢谢。

Answer 1

sizes refer to the number of features you would like to try and retain, it should numeric but you provided something weird in df[,1:1002] .尺寸是指您想要尝试和保留的功能数量，它应该是数字，但您在df[,1:1002]提供了一些奇怪的东西。

See something like below, where i simulate a dataset and setting the sizes correctly ensures it runs along to choose the optimal number of features (from what you provide):请参阅下面的内容，其中我模拟数据集并正确设置大小以确保它运行以选择最佳功能数量（从您提供的内容中）：

library(caret)
library(mlbench)
library(Hmisc)
library(randomForest) 

set.seed(101)
df = data.frame(samples=paste0("Samples",1:99),
                Class=paste0("Class",rep(1:3,33)),
                matrix(rnorm(99*1000),ncol=1000))

colnames(df)[3:ncol(df)]=paste0("Gene",1:1000)

# we create like 100 informative genes for Class1 and Class2
df[df$Class=="Class1",3:103] = df[df$Class=="Class1",3:103] + rpois(33*100,1.5)
df[df$Class=="Class2",104:203] = df[df$Class=="Class2",104:203] + rpois(33*100,1.5)

control <- rfeControl(functions=rfFuncs, method="cv", number=2)

# run the RFE algorithm
results <- rfe(df[,3:1002], df[,2], sizes = c(50,100,200), 
               rfeControl=control)

From the above, I ask for 50,100 or 200 informative features and I get:从上面，我要求 50,100 或 200 个信息特征，我得到：

results
Recursive feature selection

Outer resampling method: Cross-Validated (2 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
        50   0.9792 0.9688    0.02946 0.04419         
       100   0.9896 0.9844    0.01473 0.02210         
       200   1.0000 1.0000    0.00000 0.00000        *
      1000   1.0000 1.0000    0.00000 0.00000         

The top 5 variables (out of 200):
   Gene94, Gene198, Gene137, Gene136, Gene158

> results$optsize
[1] 200

随机森林模型中使用递归特征消除的特征选择错误

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-02-21 09:55:32

随机森林模型中使用递归特征消除的特征选择错误

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-02-21 09:55:32

解决方案1
2 已采纳 2020-02-21 09:55:32