简体   繁体   中英

variable selection in r using knn

I have a data frame (df) of 72 observations and 592 variable with one factor class variable (total of 593 variables ie dim(df) = 72 593). I am looking for a way to select 7 variables (including the class variable) using Receiver Operating Characteristics (ROC) for selection of the optimum k value. I want to use these seven variables for analysis using graphical models but I don't want to select the variables at random. I want my selection to be statistically justified.

What I would like to see as my result is something like:

Variables V23, V120, V230, V333, V496, V585, V593 were selected based on the highest value of ROC.

Ie I want to perform classification and selection of the "best" predicted variables of high accuracy so that I can used these variables for graphical modelling.

I have tried using the caret package but I don't know how to manipulate it to select variables (columns) of high accuracy which can be used for other analysis.

Thanks guys. Am sure someone understood me.

Thanks.

kutex.

I would do something like this:

library(pROC)

#' Select the N top variables with ROC analysis
#' @param response the class variable name
#' @param predictors the variables names from which to select
#' @param data must contain the predictors as columns
#' @param n the number of 
select.top.N.ROC <- function(response, predictors, data, n) {
    n <- min(n, length(predictors))
    aucs <- sapply(predictors, function(predictor) {
        auc(data[[response]], data[[predictor]])
    })
    return(predictors[order(aucs, decreasing=TRUE)][1:n])
}

top.variables <- select.top.N.ROC("class", paste("V", 1:593, sep=""), myDataFrame, 7)
cat(paste("Variables", paste(top.variables, collapse=", "), "were selected based on the highest value of ROC. "))

As with any univariate feature selection method, you may select 7 fully correlated variables that won't give you any additional information, so selecting V23 would have been sufficient. For multivariate datasets, you should consider using a multivariate feature selection method instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM