使用dplyr計算多列的分位數

Question

我有這樣的數據幀

set.seed(123)

對於矢量，如果我想生成均值，以及上下95％CI，我可以這樣做：

 x <- rnorm(20)

quantile(x, probs = 0.500) # mean
quantile(x, probs = 0.025) # lower 
quantile(x, probs = 0.975) # upper bound

我有一個數據幀

df <- data.frame(loc = rep(1:2, each = 4), 
                 year = rep(1980:1983, times = 2),
                 x1 = rnorm(8), x2 = rnorm(8), x3 = rnorm(8), x4 = rnorm(8), 
                 x5 = rnorm(8), x6 = rnorm(8), x7 = rnorm(8), x8 = rnorm(8))

對於每個位置和年份，我想使用x1到x8找到中位數，下限和上限。

df %>% group_by(loc, year) %>% 
dplyr::summarise(mean.x = quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.500),
                 lower.x = quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.025),
                 upper.x = quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.975))

但是這給了我所有人的答案。

# A tibble: 8 x 5
# Groups:   loc [?]
loc  year mean.x lower.x upper.x
<int> <int>  <dbl>   <dbl>   <dbl>
  1     1  1980 -1.07   -1.07   -1.07 
2     1  1981 -0.218  -0.218  -0.218
3     1  1982 -1.03   -1.03   -1.03 
4     1  1983 -0.729  -0.729  -0.729
5     2  1980 -0.625  -0.625  -0.625
6     2  1981 -1.69   -1.69   -1.69 
7     2  1982  0.838   0.838   0.838
8     2  1983  0.153   0.153   0.153

另外，有沒有辦法，而不是通過x1，x2 ... x8引用列，我可以通過索引之類的東西來做

3:ncol(df)

Answer 1

您可能希望首先從寬數據轉換為長數據：

require(dplyr)
require(tidyr)
df %>% gather(xvar, value, x1:x8) %>% 
group_by(loc, year) %>% 
summarise(mean.x = quantile(value, probs = 0.50),
          lower.x = quantile(value, probs = 0.025),
          upper.x = quantile(value, probs = 0.975))

你得到：

# A tibble: 8 x 5
# Groups:   loc [?]
    loc  year  mean.x lower.x upper.x
  <int> <int>   <dbl>   <dbl>   <dbl>
1     1  1980  0.152   -0.982   2.08 
2     1  1981 -0.478   -1.33    0.825
3     1  1982 -0.0415  -1.95    1.02 
4     1  1983  0.855   -0.180   1.43 
5     2  1980  0.658   -1.24    2.23 
6     2  1981  0.196   -0.782   0.827
7     2  1982 -0.629   -0.937   0.285
8     2  1983 -0.0737  -0.744   1.27

Answer 2

函數quantile僅期望一個輸入向量。 當你這樣做

quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.5)

你正在為它輸入8個輸入向量，它只使用x1而忽略x2到x8 。

例：

x <- rnorm(20)
y = rnorm(20) + 100

quantile(x, probs = 0.025) # lower 
#   2.5% 
# -1.633378 
quantile(x, y, probs = 0.025) # y will be ignored. This yields same result as quantile(x, probs = 0.025). A warning explains this
#    2.5% 
# -1.633378 
# Warning message:
#     In if (na.rm) x <- x[!is.na(x)] else if (anyNA(x)) stop("missing values and NaN's not allowed if 'na.rm' is FALSE") :
#     the condition has length > 1 and only the first element will be used

要解決您的具體問題，把x1到x8一個內部c()以形成一個向量：

df %>% group_by(loc, year) %>% 
dplyr::summarise(lower.x = quantile(c(x1, x2, x3, x4, x5, x6 , x7, x8), probs = 0.025),
                 mean.x = quantile(c(x1, x2, x3, x4, x5, x6 , x7, x8), probs = 0.5),
                 upper.x = quantile(c(x1, x2, x3, x4, x5, x6 , x7, x8), probs = 0.975))

收益率：

# A tibble: 8 x 5
# Groups:   loc [?]
    loc  year     lower.x     mean.x   upper.x
  <int> <int>       <dbl>      <dbl>     <dbl>
1     1  1980 -1.12583212  0.1683845 1.1579655
2     1  1981 -1.20363611 -0.1399433 1.9308253
3     1  1982 -0.93238412 -0.3195850 0.3835611
4     1  1983 -2.08331501 -0.4235632 1.2267823
5     2  1980 -1.46528453 -0.3096375 0.9863813
6     2  1981 -1.51563211  0.1100798 0.8267675
7     2  1982 -1.16435350  0.1885864 0.8349510
8     2  1983 -0.01427533  0.4301591 1.9688637

順便說一句上限應該是0.975，你有一個錯字0.0975

使用dplyr計算多列的分位數

問題描述

2 個解決方案

解決方案1
2 已采納 2018-07-11 14:19:30

解決方案2
1 2018-07-11 14:29:05

使用dplyr計算多列的分位數

問題描述

2 個解決方案

解決方案1 2 已采納 2018-07-11 14:19:30

解決方案2 1 2018-07-11 14:29:05

解決方案1
2 已采納 2018-07-11 14:19:30

解決方案2
1 2018-07-11 14:29:05