简体   繁体   English

R:在R中的数据帧上使用cut,每列具有不同的断点

[英]R: Using cut on a data frame in R, with different breakpoints for each column

I am trying to factorize a data frame, where the cut-offs would be min, median, max of each variable (column). 我正在尝试分解一个数据框,其中的临界值是每个变量(列)的最小值,中位数和最大值。

I have managed to do so, by creating a data frame "cuts" in which the respective values are stored, and using a for loop afterwards. 通过创建一个存储了各个值的数据框“切割”,并随后使用了for循环,我设法做到了。 However, I feel like it could be done more elegantly. 但是,我觉得可以做得更优雅。 Any idea would be welcome! 任何想法都将受到欢迎!

A reproducible example follows: 一个可重现的示例如下:

# Sample data frame
mydf <- na.omit(airquality)[1:20,1:4]

# Break points
cuts<-rbind(sapply(mydf,min),sapply(mydf,median),sapply(mydf,max))

# Data frame to keep factors
mydf.bin <- mydf

for (i in 1:ncol(mydf)) {
  mydf.bin[,i]<-cut(mydf[,i],cuts[,i],include.lowest=T)
}

mydf.bin

#I am looking for something like the following, except each column should have different break points
mybindf<-sapply(mydf, cut, c(0,50,350), include.lowest=T)

Why not just use an anonymous function? 为什么不只使用匿名功能?

mybindf <- sapply(mydf, function(x) {
    cuts <- c(min(x), median(x), max(x))
    cut(x, cuts, include.lowest = TRUE)
})

Better practice in this case might be to define the function separately, which makes for easier debugging and more readable code: 在这种情况下,更好的做法是分别定义函数,这使调试更容易且代码更易读:

cut_min_med_max <- function(x) {
    cuts <- c(min(x), median(x), max(x))
    cut(x, cuts, include.lowest = TRUE)
}
mybindf <- sapply(mydf, cut_min_med_max)

The only difference between these solutions and your solution is that you generate the cut points separately from making the cuts, while here everything happens at once. 这些解决方案与您的解决方案之间的唯一区别是,生成切割点与进行切割是分开生成的,而此处的所有操作都是一次性发生的。

And for completeness, your original code can be vectorized: 为了完整起见,可以对原始代码进行矢量化处理:

mybindf <- as.data.frame(
    mapply(cut, mydf, cuts, MoreArgs = list(include.lowest = TRUE))
)

although you could just as easily have dropped both steps into the for loop. 尽管您可以轻松地将两个步骤都放入for循环中。

You can use the bin function from the OneR package : 您可以使用OneR包中bin函数:

library(OneR)

# Sample data frame
mydf <- na.omit(airquality)[1:20,1:4]

# bin function is an enhanced version of cut for data frames
mydf.bin <- bin(mydf, nbins = 2, method = "content")

mydf.bin
##        Ozone    Solar.R        Wind      Temp
## 1    (15,41]  (170,334] (7.39,11.5] (64.5,74]
## 2    (15,41] (7.67,170] (7.39,11.5] (64.5,74]
## 3  (0.96,15] (7.67,170] (11.5,20.1] (64.5,74]
## 4    (15,41]  (170,334] (7.39,11.5] (57,64.5]
## 7    (15,41]  (170,334] (7.39,11.5] (64.5,74]
## 8    (15,41] (7.67,170] (11.5,20.1] (57,64.5]
## 9  (0.96,15] (7.67,170] (11.5,20.1] (57,64.5]
## 12   (15,41]  (170,334] (7.39,11.5] (64.5,74]
## 13 (0.96,15]  (170,334] (7.39,11.5] (64.5,74]
## 14 (0.96,15]  (170,334] (7.39,11.5] (64.5,74]
## 15   (15,41] (7.67,170] (11.5,20.1] (57,64.5]
## 16 (0.96,15]  (170,334] (7.39,11.5] (57,64.5]
## 17   (15,41]  (170,334] (11.5,20.1] (64.5,74]
## 18 (0.96,15] (7.67,170] (11.5,20.1] (57,64.5]
## 19   (15,41]  (170,334] (7.39,11.5] (64.5,74]
## 20 (0.96,15] (7.67,170] (7.39,11.5] (57,64.5]
## 21 (0.96,15] (7.67,170] (7.39,11.5] (57,64.5]
## 22 (0.96,15]  (170,334] (11.5,20.1] (64.5,74]
## 23 (0.96,15] (7.67,170] (7.39,11.5] (57,64.5]
## 24   (15,41] (7.67,170] (11.5,20.1] (57,64.5]

NB: the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals as is standard behaviour of the cut function. 注意:将外部极限值移开量程的0.1%,以确保极限值均与cut功能的标准行为一样,均位于中断间隔内。

(Full discosure: I am the author of this package) (完全公开:我是该软件包的作者)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM