使用cut2（沒有[]符號）在Hmisc中獲得漂亮的剪切

Question

我正在嘗試使用Hmisc包整齊地剪切數據，如下例所示：

dummy <- data.frame(important_variable=seq(1:1000))
require(Hmisc)
dummy$cuts <- cut2(dummy$important_variable, g = 4)

生成的切割對於值是正確的：

  important_variable       cuts
1                  1 [  1, 251)
2                  2 [  1, 251)
3                  3 [  1, 251)
4                  4 [  1, 251)
5                  5 [  1, 251)
6                  6 [  1, 251)
> table(dummy$cuts)
[  1, 251) [251, 501) [501, 751) [751,1000] 
       250        250        250        250

但是，我希望數據的呈現方式略有不同。 例如，而不是

[ 1,251 ]

[ 251,501 ]

我更喜歡這種符號

1 - 250

251 - 500

由於我在多個變量上做了很多，我對可重現的解決方案很感興趣，這個解決方案很容易應用於多個變量。

編輯

根據評論中的討論，解決方案必須處理更混亂的變量，如x2 <- runif(100, 5.0, 7.5) 。

Answer 1

我們可以使用gsubfn刪除括號，也可以通過從第二組數字中減去一個來更改數字部分

 library(gsubfn)
 v1 <- dummy$cuts
 v1New <-  gsubfn('\\[\\s*(\\d+),\\s*(\\d+)[^0-9]+', ~paste0(x, '-', 
                     as.numeric(y)-1), as.character(v1))
 table(v1New)
 # 1-250 251-500 501-750 751-999 
 #  250     250     250     250

對於涉及小數的第二種情況，我們需要將數字與小數相匹配，並通過將它們放在括號中來捕獲這些組（ ([0-9.]+) ， (\\\\d+\\\\.\\\\d+) ）。 我們通過轉換為'numeric'並從中減去0.01來更改第二組捕獲組（ as.numeric(y)-0.01 ）。 \\\\s*表示0或更多空格。 格式中的空格不均勻，因此我們必須使用它而不是\\\\s+ ，即1個或更多個空格。

 v2New <- gsubfn('\\[\\s*([0-9.]+),(\\d+\\.\\d+).*', ~paste0(x,
                 '-',as.numeric(y)-0.01), as.character(v2))
 table(v2New)
 v2New
 #5.00-5.59 5.60-6.12 6.13-6.71 6.72-7.49 
 #    25        25        25        25

數據

 set.seed(24)
 x2 <- runif(100, 5.0, 7.5)
 v2 <- cut2(x2, g=4)

Answer 2

這為整數和小數范圍提供了通用解決方案（無需手動指定增量）：

library(stringr)

pretty_cuts <- function(cut_str) {

  # so we know when to not do something

  first_val <- as.numeric(str_extract_all(cut_str[1], "[[:digit:]\\.]+")[[1]][1])
  last_val <- as.numeric(str_extract_all(cut_str[length(cut_str)], "[[:digit:]\\.]+")[[1]][2])

  sapply(seq_along(cut_str), function(i) {

    # get cut range

    x <- str_extract_all(cut_str[i], "[[:digit:]\\.]+")[[1]]

    # see if a double vs an int & get # of places if decimal so
    # we know how much to inc/dec

    inc_dec <- 1
    if (str_detect(x[1], "\\.")) {
      x <- as.numeric(x)
      inc_dec <- 10^(-match(TRUE, round(x[1], 1:20) == x[1]))
    } else {
      x <- as.numeric(x)
    }

    # if not the edge cases inc & dec

    if (x[1] != first_val) { x[1] <- x[1] + inc_dec }
    if (x[2] != last_val)  { x[2] <- x[2] - inc_dec }

    sprintf("%s - %s", as.character(x[1]), as.character(x[2]))

  })

}

dummy <- data.frame(important_variable=seq(1:1000))
dummy$cuts <- cut2(dummy$important_variable, g = 4)
a <- pretty_cuts(dummy$cuts)

unique(dummy$cuts)
## [1] [  1, 251) [251, 501) [501, 751) [751,1000]
## Levels: [  1, 251) [251, 501) [501, 751) [751,1000]

unique(a)
## [1] "1 - 250"    "252 - 500"  "502 - 750"  "752 - 1000"

x2 <- runif(100, 5.0, 7.5)
b <- pretty_cuts(cut2(x2, g=4))

unique(cut2(x2, g=4))
## [1] [5.54,6.28) [6.28,6.97) [6.97,7.50] [5.02,5.54)
## Levels: [5.02,5.54) [5.54,6.28) [6.28,6.97) [6.97,7.50]

unique(b)
## [1] "5.54 - 6.27" "6.29 - 6.97" "6.98 - 7.49" "5.03 - 5.53"

使用cut2（沒有[]符號）在Hmisc中獲得漂亮的剪切

問題描述

編輯

2 個解決方案

解決方案1
4 已采納 2015-08-02 12:51:32

數據

解決方案2
3 2015-08-02 12:53:34

使用cut2（沒有[]符號）在Hmisc中獲得漂亮的剪切

問題描述

編輯

2 個解決方案

解決方案1 4 已采納 2015-08-02 12:51:32

數據

解決方案2 3 2015-08-02 12:53:34

解決方案1
4 已采納 2015-08-02 12:51:32

解決方案2
3 2015-08-02 12:53:34