简体   繁体   中英

how can i make some datasets in r?

I'm studying imbalanced learning problems. 'Imbalanced' means that data set exhibits an unequal distribution between its classes, for example you gave a binary classification problem with 1000 examples. A total of 900 examples are labeled with class-0 and the others 100 examples are labeled with class-1.

Most of classification algorithms do not consider the underlying distribution of the datasets, so they didn't handle this 'imbalanced learning problems'. Because if they classify all examples to class-0, then they have 90% accuracy.

One of the leading problems in class imbalance classification is class overlapping occurrences in the datasets. Also the imbalance within a single class might aggravate the problems.(Classification with class imbalance problem: A review, Aida Ali, Siti Mariyam Shamsuddin, and Anca L.Ralescu, ISSN 2074-8523)

So I want to simulate these problems like 1) comparing some methods when datasets have different overlapping degrees, 2) comparing some methods when dataset heve within class imbalance. datasets have overlapping

datasets have within class imbalance

So I have to make dataset in r, I don't know how to generate these datasets. I just make some independent variables

set.seed(3)
x1    <- rnorm(n)           # normal dist
x3    <- rexp(n)            # exponential dist
x5    <- rpois(n,lambda=3)  # poisson dist
error <- rnorm(n)           # error term

And now I have to make class variable Y that has relation with these X's. I think I can adjust overlapping ratio by coefficient of X's.

IR  <- 90 # IR means imbalanced ratio  'IR=# of class0/# of class1'
eta <- -200*x1 + 0.5*sin(x3) + 300*x5^3 + error
Y   <- as.factor( ifelse( eta > quantile( eta, IR/(IR+1) ), 1, 0) )

But Actually I don't know my code is correct. Also I wonder how can i make within class imbalance dataset. Could you help me? How can I make this datasets?

You could use built-in function twoClassSim from caret package.

library(caret)
set.seed(123)
data <- twoClassSim(
  1000,
  intercept = -16.5,
  linearVars = 15,
  noiseVars = 5
)
table(data$Class)

Class1 Class2 
   899    101 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM