简体   繁体   中英

dplyr for rowwise quantiles

I have a df of strata, each of which has 1000 samples from a posterior distribution of the estimates from that stratum.

mydf <- as.data.frame(lapply(seq(1, 1000), rnorm, n=100))
colnames(mydf) <- paste('s', seq(1, ncol(mydf)), sep='')

I want to add columns for a few quantiles of the distribution for each row. In classic R, I'd write this.

quants <- t(apply(mydf, 1, quantile, probs=c(.025, .5, .975)))
colnames(quants) <- c('s_lo', 's_med', 's_hi')
mydf <- cbind(mydf, quants)

I suspect there's a direct way to do this in dplyr (maybe rowwise ?) but my attempts have failed. Ideas?

dplyr is not optimized for row-based calculations like that. Though you can do this with rowwise() , I recommend against it: performance will be abysmal. Your best speed will likely be with something that expects a matrix , and can operate on the rows. I suggest apply .

Instead of dealing with a 100x1000 data.frame , for brevity I'll go with 5 columns:

set.seed(2)
mydf <- as.data.frame(lapply(seq(1, 5), rnorm, n=10))
colnames(mydf) <- paste('s', seq(1, ncol(mydf)), sep='')

Converting to a matrix is only reasonable if all columns are of the same class . In this case, they are all numeric so we are safe. (If you have non-numeric columns in the dataframe, extract only the ones you need here and bind them back in later.)

mymtx <- as.matrix(mydf)
apply(mymtx, 1, quantile, c(0.1, 0.9))
#         [,1]     [,2]     [,3]     [,4]     [,5]       [,6]     [,7]     [,8]     [,9]    [,10]
# 10% 1.028912 1.430939 1.999521 0.305907 1.753824 0.03267599 1.934381 1.270504 2.995816 1.489634
# 90% 4.950067 3.807735 4.881554 6.123989 4.886388 5.55628806 4.207605 4.184460 4.406384 3.782134

One notable with using apply like this is that the result is in row-based form, perhaps transposed from what one would expect. Simply wrap it in t(...) and you'll see the columns you might expect.

This can be recombined with the original dataframe using cbind or similar function.

This can be done in a pipeline like so:

mydf %>%
  bind_cols(as.data.frame(t(apply(., 1, quantile, c(0.1, 0.9)))))
#            s1         s2        s3       s4       s5        10%      90%
# 1   0.1030855  2.4176508 5.0908192 4.738939 4.616414 1.02891157 4.950067
# 2   1.1848492  2.9817528 1.8000742 4.318960 3.040897 1.43093918 3.807735
# 3   2.5878453  1.6073046 4.5896382 5.076164 4.158295 1.99952092 4.881554
# 4  -0.1303757  0.9603310 4.9546516 3.715842 6.903547 0.30590700 6.123989
# 5   0.9197482  3.7822290 3.0049378 3.223325 5.622494 1.75382406 4.886388
# 6   1.1324203 -0.3110691 0.5482936 3.404340 6.990920 0.03267599 5.556288
# 7   1.7079547  2.8786046 3.4772373 2.274020 4.694516 1.93438093 4.207605
# 8   0.7603020  2.0358067 2.4034418 3.097416 4.909156 1.27050387 4.184460
# 9   2.9844739  3.0128287 3.7922033 3.440938 4.815839 2.99581584 4.406384
# 10  0.8612130  2.4322652 3.2896367 3.753487 3.801232 1.48963385 3.782134

I'll leave the column naming up to you.

With data.frame -like structures, it's going to be very hard to do rowwise operations efficiently, due to the nature of the data structure. A more efficient solution is probably to reshape the data, do the calculation blockwise in the column, and then join the result back. With dplyr + tidyr , something like this:

library(dplyr)
library(tidyr)
mydf <- as_data_frame(mydf) %>% 
    mutate(id = row_number())

quants <- mydf %>% 
    gather(sample, value, -id) %>% 
    group_by(id) %>% 
    summarize(q025 = quantile(value, 0.025),
              q500 = quantile(value, 0.5),
              q975 = quantile(value, 0.975)) %>% 
    ungroup()

result <- left_join(quants, mydf)

Or, if speed is particularly important, with data.table ...

library(data.table)
setDT(mydf)
mydf[, id := .I]
mydf_melt <- melt(mydf, id.vars = 'id')
quants <- mydf_melt[, as.list(quantile(value, c(0.025, 0.5, 0.975))), by = id]
setkey(quants, 'id')
setkey(mydf, 'id')
result <- quants[mydf]

purrr::pmap can be useful for such cases, iterating in parallel through items in a list, which with a data.frame is operating rowwise. It's more useful if each item contains a parameter or if the function accepts dots, though; otherwise you have to collect a vector with c .

library(tidyverse)
set.seed(47)

mydf <- as.data.frame(lapply(seq(1000), rnorm, n = 100))
names(mydf) <- paste0('s', seq_along(mydf))

# make vector of each row; pass to quantile; convert to list; simplify to data.frame
mydf %>% pmap_df(~as.list(quantile(c(...), c(.025, .5, .975)))) %>% 
    bind_cols(mydf)    # self join to original columns

#> # A tibble: 100 × 1,003
#>      `2.5%`    `50%`  `97.5%`          s1       s2        s3       s4
#>       <dbl>    <dbl>    <dbl>       <dbl>    <dbl>     <dbl>    <dbl>
#> 1  24.52876 501.2313 974.1547  2.99469634 1.857485 4.8062449 5.412425
#> 2  25.96306 501.5381 975.4427  1.71114251 1.534527 5.0045983 4.029735
#> 3  25.36792 499.8048 974.9472  1.18540528 1.575371 2.1515656 4.537178
#> 4  27.15081 500.9932 975.3688  0.71823499 2.747321 0.9841692 3.774623
#> 5  25.77212 498.7223 974.5576  1.10877555 2.659429 4.6865536 5.448446
#> 6  25.43256 501.2437 973.7319 -0.08573747 2.198829 3.7851258 5.769600
#> 7  24.29993 500.8599 975.5050  0.01451784 1.938954 4.1822894 5.205473
#> 8  25.16637 501.8597 974.8636  1.01513086 3.492032 3.2551467 2.570020
#> 9  25.36332 500.3975 973.3588  0.74795410 3.660735 3.3051286 4.270915
#> 10 27.02456 499.8759 974.3890 -0.46575030 2.771156 3.4292355 3.372155
#> # ... with 90 more rows, and 996 more variables: s5 <dbl>, s6 <dbl>,
#> #   s7 <dbl>, s8 <dbl>, s9 <dbl>, s10 <dbl>, s11 <dbl>, s12 <dbl>,
#> #   s13 <dbl>, s14 <dbl>, ...

The names generated by quantile are not syntactic, but could easily be replaced by inserting set_names(c('s_lo', 's_med', 's_hi')) before bind_cols . There are many other ways to reassemble the results, as well, if you like.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM