简体   繁体   English

将一定范围的值划分为等长的区间:cut vs cut2

[英]divide a range of values in bins of equal length: cut vs cut2

I'm using the cut function to split my data in equal bins, it does the job but I'm not happy with the way it returns the values. 我正在使用cut函数将数据分成相等的bin,它可以完成工作,但是我对返回值的方式不满意。 What I need is the center of the bin not the upper and lower ends. 我需要的是垃圾箱的中心,而不是上下两端。
I've also tried to use cut2{Hmisc} , this gives me the center of each bins, but it divides the range of data in bins that contains the same numbers of observations, rather than being of the same length. 我也尝试过使用cut2{Hmisc} ,这给了我每个bin的中心,但是它将数据范围划分为bin中包含相同数量的观察值,而不是相同长度的数据。

Does anyone have a solution to this? 有人对此有解决方案吗?

It's not too hard to make the breaks and labels yourself, with something like this. 像这样,让自己休息一下并贴上标签并不难。 Here since the midpoint is a single number, I don't actually return a factor with labels but instead a numeric vector. 在这里,由于中点是单个数字,因此我实际上没有返回带有标签的因数,而是返回了一个数字向量。

cut2 <- function(x, breaks) {
  r <- range(x)
  b <- seq(r[1], r[2], length=2*breaks+1)
  brk <- b[0:breaks*2+1]
  mid <- b[1:breaks*2]
  brk[1] <- brk[1]-0.01
  k <- cut(x, breaks=brk, labels=FALSE)
  mid[k]
}

There's probably a better way to get the bin breaks and midpoints; 可能有一种更好的方法来获取垃圾箱中断和中点; I didn't think about it very hard. 我没有很难考虑。

Note that this answer is different than Joshua's; 注意,这个答案与约书亚的答案不同。 his gives the median of the data in each bins while this gives the center of each bin. 他给出了每个分类中数据的中位数,而给出了每个分类中的数据中心。

> head(cut2(x,3))
[1] 16.666667  3.333333 16.666667  3.333333 16.666667 16.666667
> head(ave(x, cut(x,3), FUN=median))
[1] 18  2 18  2 18 18

Use ave like so: 像这样使用ave

set.seed(21)
x <- sample(0:20, 100, replace=TRUE)
xCenter <- ave(x, cut(x,3), FUN=median)

We can use smart_cut from package cutr : 我们可以使用smart_cut从包cutr

devtools::install_github("moodymudskipper/cutr")
library(cutr)

Using @Joshua's sample data: 使用@Joshua的样本数据:

median by interval (same output as @Joshua except it's an ordered factor) : 中位数按间隔(与@Joshua相同,但有序因数):

smart_cut(x,3, "n_intervals", labels= ~ median(.))
# [1] 18 2  18 2  18 18 ...
# Levels: 2 < 11 < 18

center of each interval (same output as @Aaron except it's an ordered factor) : 每个间隔的中心(与@Aaron相同,但它是有序因子):

smart_cut(x,3, "n_intervals", labels= ~ mean(.y))
# [1] 16.67 3.333 16.67 3.333 16.67 16.67 ...
# Levels: 3.333 < 10 < 16.67

mean of values by interval : 间隔的平均值:

smart_cut(x,3, "n_intervals", labels= ~ mean(.))
# [1] 17.48 2.571 17.48 2.571 17.48 17.48 ...
# Levels: 2.571 < 11.06 < 17.48

labels can be a character vector just like in base::cut.default , but it can also be, as it is here, a function of 2 parameters, the first being the values contained in the bin, and the second the cut points of the bin. labels可以是一个字符向量,就像base::cut.default ,但也可以是2个参数的函数,第一个是bin中包含的值,第二个是垃圾箱。

more on cutr and smart_cut 有关cutr和smart_cut的更多信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM