I have a data frame (df) that looks like that:
Value Country ID
1 21 RU AAAU9001025
2 24 NG AAAU9001848
3 17 EG ACLU2799370
4 2 EG ACLU2799370
5 56 RU ACLU2799370
I want to run SVM classifier for outlier detection on the value, per country, and based on relative small sample, I want to indicate if it is an outlier in each row. So my output will be a data frame with additional logical column that indicates if its an outlier:
Value Country ID SVM
1 21 RU AAAU9001025 FALSE
2 24 NG AAAU9001848 FALSE
3 17 EG ACLU2799370 FALSE
4 2 EG ACLU2799370 TRUE
5 56 RU ACLU2799370 TRUE
6 25 EG AMFU3022141 FALSE
I am using the following code but I dont manage to create the desired dataframe:
lapply(split(df,df$Country),
function(x) {(e1071::svm(x$Value[1:(ifelse(nrow(x)<50000,nrow(x),50000))],
nu=0.98, type="one-classification", kernel="polynomial"))
})
please try to help me figure this out, thanks!
simulate something like your data:
NROWS = c(3000,6000,10000)
names(NROWS)=c("RU","EG","NG")
df = lapply(names(NROWS),function(i){
data.frame(
Value = c(rnorm(0.9*NROWS[i]),rpois(0.1*NROWS[i],5)),
Country=i,
ID = paste0(i,"_",1:NROWS[i])
)
})
df = do.call(rbind,df)
Create a function to do svm, because you predict on a subset but return everything..
library(e1071)
SVM_f = function(x,limit=5000){
N = min(c(limit,length(x)))
mdl = svm(x[sample(length(x),N)],
nu=0.98, type="one-classification", kernel="polynomial")
predict(mdl,x)
}
res = by(df,df$Country,function(x){
data.frame(x,SVM = SVM_f(x$Value))
})
res = do.call(rbind,res)
Value Country ID SVM
RU.1 1.2802954 RU RU_1 FALSE
RU.2 -2.7119588 RU RU_2 FALSE
RU.3 -0.4856534 RU RU_3 FALSE
RU.4 -0.5041824 RU RU_4 FALSE
RU.5 -0.7043723 RU RU_5 FALSE
RU.6 0.0472744 RU RU_6 FALSE
You can also use dplyr, but it might run a bit slower:
library(dplyr)
df %>% group_by(Country) %>% mutate(SVM=SVM_f(Value))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.