使用 R 的随机森林

Question

i'm working on building a predictive model for breast cancer data using R. After performing gcrma normalization, i generated the potential predictor variables.我正在使用 R 为乳腺癌数据构建预测模型。在执行 gcrma 归一化后，我生成了潜在的预测变量。 Now while i run the RF algorithm i encountered the following error现在，当我运行 RF 算法时，我遇到了以下错误

rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)

Error:   Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.

code:代码：

library(randomForest)
library(ROCR)
library(Hmisc)
library(genefilter)

setwd("E:/kavya's project_work/final")
datafile<-"trainset_gcrma.txt"
clindatafile<-read.csv("mod clinical_details.csv")

outfile="trainset_RFoutput.txt"
varimp_pdffile="trainset_varImps.pdf"
MDS_pdffile="trainset_MDS.pdf"
ROC_pdffile="trainset_ROC.pdf"
case_pred_outfile="trainset_CasePredictions.txt"
vote_dist_pdffile="trainset_vote_dist.pdf"

data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=clindatafile
clincaldata_order=order(clin_data_import[,"GEO.asscession.number"])
clindata=clin_data_import[clincaldata_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)

X=rawdata[,4:length(header)]
ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
filt=genefilter(2^X,ffun)
filt_Data=rawdata[filt,]



#Get potential predictor variables
predictor_data=t(filt_Data[,4:length(header)])
predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
colnames(predictor_data)=predictor_names


target= clindata[,"relapse"]
target[target==0]="NoRelapse"
target[target==1]="Relapse"
target=as.factor(target)

tmp = as.vector(table(target))
num_classes = length(tmp)
min_size = tmp[order(tmp,decreasing=FALSE)[1]]
sampsizes = rep(min_size,num_classes)
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)


error:"Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories."

as i'm new to Machine learning i'm unable to proceed.由于我是机器学习的新手，因此无法继续。 kindly do the needful.请做需要的。 Thnks in advance.提前谢谢。

Answer 1

It is hard to say without knowing the data.不知道数据就很难说。 Run class or summary on all your predictor variables to ensure that they are not accidentally interpreted as characters or factors.对所有预测变量运行class或summary以确保它们不会被意外解释为字符或因素。 If you really do have more than 53 levels, you will have to convert them to binary variables.如果您确实有超过 53 个级别，则必须将它们转换为二进制变量。 Example:例子：

mtcars$automatic <- mtcars$am == 0
mtcars$manual <- mtcars$am == 1

使用 R 的随机森林

问题描述

1 个解决方案

解决方案1
0 2016-03-16 09:20:16

使用 R 的随机森林

问题描述

1 个解决方案

解决方案1 0 2016-03-16 09:20:16

解决方案1
0 2016-03-16 09:20:16