[英]R: Using cut on a data frame in R, with different breakpoints for each column
I am trying to factorize a data frame, where the cut-offs would be min, median, max of each variable (column). 我正在尝试分解一个数据框,其中的临界值是每个变量(列)的最小值,中位数和最大值。
I have managed to do so, by creating a data frame "cuts" in which the respective values are stored, and using a for loop afterwards. 通过创建一个存储了各个值的数据框“切割”,并随后使用了for循环,我设法做到了。 However, I feel like it could be done more elegantly.
但是,我觉得可以做得更优雅。 Any idea would be welcome!
任何想法都将受到欢迎!
A reproducible example follows: 一个可重现的示例如下:
# Sample data frame
mydf <- na.omit(airquality)[1:20,1:4]
# Break points
cuts<-rbind(sapply(mydf,min),sapply(mydf,median),sapply(mydf,max))
# Data frame to keep factors
mydf.bin <- mydf
for (i in 1:ncol(mydf)) {
mydf.bin[,i]<-cut(mydf[,i],cuts[,i],include.lowest=T)
}
mydf.bin
#I am looking for something like the following, except each column should have different break points
mybindf<-sapply(mydf, cut, c(0,50,350), include.lowest=T)
Why not just use an anonymous function? 为什么不只使用匿名功能?
mybindf <- sapply(mydf, function(x) {
cuts <- c(min(x), median(x), max(x))
cut(x, cuts, include.lowest = TRUE)
})
Better practice in this case might be to define the function separately, which makes for easier debugging and more readable code: 在这种情况下,更好的做法是分别定义函数,这使调试更容易且代码更易读:
cut_min_med_max <- function(x) {
cuts <- c(min(x), median(x), max(x))
cut(x, cuts, include.lowest = TRUE)
}
mybindf <- sapply(mydf, cut_min_med_max)
The only difference between these solutions and your solution is that you generate the cut points separately from making the cuts, while here everything happens at once. 这些解决方案与您的解决方案之间的唯一区别是,生成切割点与进行切割是分开生成的,而此处的所有操作都是一次性发生的。
And for completeness, your original code can be vectorized: 为了完整起见,可以对原始代码进行矢量化处理:
mybindf <- as.data.frame(
mapply(cut, mydf, cuts, MoreArgs = list(include.lowest = TRUE))
)
although you could just as easily have dropped both steps into the for
loop. 尽管您可以轻松地将两个步骤都放入
for
循环中。
You can use the bin
function from the OneR package : 您可以使用OneR包中的
bin
函数:
library(OneR)
# Sample data frame
mydf <- na.omit(airquality)[1:20,1:4]
# bin function is an enhanced version of cut for data frames
mydf.bin <- bin(mydf, nbins = 2, method = "content")
mydf.bin
## Ozone Solar.R Wind Temp
## 1 (15,41] (170,334] (7.39,11.5] (64.5,74]
## 2 (15,41] (7.67,170] (7.39,11.5] (64.5,74]
## 3 (0.96,15] (7.67,170] (11.5,20.1] (64.5,74]
## 4 (15,41] (170,334] (7.39,11.5] (57,64.5]
## 7 (15,41] (170,334] (7.39,11.5] (64.5,74]
## 8 (15,41] (7.67,170] (11.5,20.1] (57,64.5]
## 9 (0.96,15] (7.67,170] (11.5,20.1] (57,64.5]
## 12 (15,41] (170,334] (7.39,11.5] (64.5,74]
## 13 (0.96,15] (170,334] (7.39,11.5] (64.5,74]
## 14 (0.96,15] (170,334] (7.39,11.5] (64.5,74]
## 15 (15,41] (7.67,170] (11.5,20.1] (57,64.5]
## 16 (0.96,15] (170,334] (7.39,11.5] (57,64.5]
## 17 (15,41] (170,334] (11.5,20.1] (64.5,74]
## 18 (0.96,15] (7.67,170] (11.5,20.1] (57,64.5]
## 19 (15,41] (170,334] (7.39,11.5] (64.5,74]
## 20 (0.96,15] (7.67,170] (7.39,11.5] (57,64.5]
## 21 (0.96,15] (7.67,170] (7.39,11.5] (57,64.5]
## 22 (0.96,15] (170,334] (11.5,20.1] (64.5,74]
## 23 (0.96,15] (7.67,170] (7.39,11.5] (57,64.5]
## 24 (15,41] (7.67,170] (11.5,20.1] (57,64.5]
NB: the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals as is standard behaviour of the cut
function. 注意:将外部极限值移开量程的0.1%,以确保极限值均与
cut
功能的标准行为一样,均位于中断间隔内。
(Full discosure: I am the author of this package) (完全公开:我是该软件包的作者)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.