R：在R中的数据帧上使用cut，每列具有不同的断点

Question

I am trying to factorize a data frame, where the cut-offs would be min, median, max of each variable (column). 我正在尝试分解一个数据框，其中的临界值是每个变量（列）的最小值，中位数和最大值。

I have managed to do so, by creating a data frame "cuts" in which the respective values are stored, and using a for loop afterwards. 通过创建一个存储了各个值的数据框“切割”，并随后使用了for循环，我设法做到了。 However, I feel like it could be done more elegantly. 但是，我觉得可以做得更优雅。 Any idea would be welcome! 任何想法都将受到欢迎！

A reproducible example follows: 一个可重现的示例如下：

# Sample data frame
mydf <- na.omit(airquality)[1:20,1:4]

# Break points
cuts<-rbind(sapply(mydf,min),sapply(mydf,median),sapply(mydf,max))

# Data frame to keep factors
mydf.bin <- mydf

for (i in 1:ncol(mydf)) {
  mydf.bin[,i]<-cut(mydf[,i],cuts[,i],include.lowest=T)
}

mydf.bin

#I am looking for something like the following, except each column should have different break points
mybindf<-sapply(mydf, cut, c(0,50,350), include.lowest=T)

Answer 1

Why not just use an anonymous function? 为什么不只使用匿名功能？

mybindf <- sapply(mydf, function(x) {
    cuts <- c(min(x), median(x), max(x))
    cut(x, cuts, include.lowest = TRUE)
})

Better practice in this case might be to define the function separately, which makes for easier debugging and more readable code: 在这种情况下，更好的做法是分别定义函数，这使调试更容易且代码更易读：

cut_min_med_max <- function(x) {
    cuts <- c(min(x), median(x), max(x))
    cut(x, cuts, include.lowest = TRUE)
}
mybindf <- sapply(mydf, cut_min_med_max)

The only difference between these solutions and your solution is that you generate the cut points separately from making the cuts, while here everything happens at once. 这些解决方案与您的解决方案之间的唯一区别是，生成切割点与进行切割是分开生成的，而此处的所有操作都是一次性发生的。

And for completeness, your original code can be vectorized: 为了完整起见，可以对原始代码进行矢量化处理：

mybindf <- as.data.frame(
    mapply(cut, mydf, cuts, MoreArgs = list(include.lowest = TRUE))
)

although you could just as easily have dropped both steps into the for loop. 尽管您可以轻松地将两个步骤都放入for循环中。

Answer 2

You can use the bin function from the OneR package : 您可以使用OneR包中的bin函数：

library(OneR)

# Sample data frame
mydf <- na.omit(airquality)[1:20,1:4]

# bin function is an enhanced version of cut for data frames
mydf.bin <- bin(mydf, nbins = 2, method = "content")

mydf.bin
##        Ozone    Solar.R        Wind      Temp
## 1    (15,41]  (170,334] (7.39,11.5] (64.5,74]
## 2    (15,41] (7.67,170] (7.39,11.5] (64.5,74]
## 3  (0.96,15] (7.67,170] (11.5,20.1] (64.5,74]
## 4    (15,41]  (170,334] (7.39,11.5] (57,64.5]
## 7    (15,41]  (170,334] (7.39,11.5] (64.5,74]
## 8    (15,41] (7.67,170] (11.5,20.1] (57,64.5]
## 9  (0.96,15] (7.67,170] (11.5,20.1] (57,64.5]
## 12   (15,41]  (170,334] (7.39,11.5] (64.5,74]
## 13 (0.96,15]  (170,334] (7.39,11.5] (64.5,74]
## 14 (0.96,15]  (170,334] (7.39,11.5] (64.5,74]
## 15   (15,41] (7.67,170] (11.5,20.1] (57,64.5]
## 16 (0.96,15]  (170,334] (7.39,11.5] (57,64.5]
## 17   (15,41]  (170,334] (11.5,20.1] (64.5,74]
## 18 (0.96,15] (7.67,170] (11.5,20.1] (57,64.5]
## 19   (15,41]  (170,334] (7.39,11.5] (64.5,74]
## 20 (0.96,15] (7.67,170] (7.39,11.5] (57,64.5]
## 21 (0.96,15] (7.67,170] (7.39,11.5] (57,64.5]
## 22 (0.96,15]  (170,334] (11.5,20.1] (64.5,74]
## 23 (0.96,15] (7.67,170] (7.39,11.5] (57,64.5]
## 24   (15,41] (7.67,170] (11.5,20.1] (57,64.5]

NB: the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals as is standard behaviour of the cut function. 注意：将外部极限值移开量程的0.1％，以确保极限值均与cut功能的标准行为一样，均位于中断间隔内。

(Full discosure: I am the author of this package) （完全公开：我是该软件包的作者）

R：在R中的数据帧上使用cut，每列具有不同的断点

问题描述

2 个解决方案

解决方案1
0 已采纳 2014-10-29 21:07:02

解决方案2
0 2017-05-21 10:19:42

R：在R中的数据帧上使用cut，每列具有不同的断点

问题描述

2 个解决方案

解决方案1 0 已采纳 2014-10-29 21:07:02

解决方案2 0 2017-05-21 10:19:42

解决方案1
0 已采纳 2014-10-29 21:07:02

解决方案2
0 2017-05-21 10:19:42