简体   繁体   中英

Estimating the standard deviation from mean and confidence intervals with a gamma distribution in R

I have the following problem I'd like to solve in R and apply to a larger workflow. I need to estimate the standard deviation from a gamma distribution where the mean and 95% confidence intervals are known.

state = c("group1", "group2", "group3")
mean = c(0.970, 0.694, 0.988)
lowers = c(0.527, 0.381, 0.536)
uppers = c(1.87, 1.37, 1.90)

df = data.frame(state=state, mean=mean, lower=lower, upper=upper)

Using excel and the "solver" tool I can adjust the standard deviation to minimize the sum of squared differences between the target 2.5 (lowers) and 97.5 (uppers) percentiles of the distribution with the actuals. Challenge is this needs to be scaled up to a rather large set of data and operationalized in my R dataframe workflow. Any ideas how to solve this?

I think this problem is ultimately an optimization problem, dealing one row of data at a time. Since you want to scale it, though, here's an approximation for finding the distribution core parameters.

This process is not an optimization: it expands a defined range of possible k (shape) parameters and finds the shape/scale combination (given your mean) that most closely resembles your upper and lower quantiles. You control the granularity of k , which is as good as you're going to get to having a tolerance (which would be appropriate for optimizations).

As such, this process will be imperfect . I hope it gets you a fast-enough process for good-enough accuracy.

I'm going to first demonstrate a process that operates one row at a time, as guesser1 . I'll then expand it to do the same operation on an arbitrary number of mean , lower , and upper .

Data with Known Answers

But first, I want to generate my own samples so that we have known "truth" for this guesser.

library(dplyr)
set.seed(42)
n <- 4
randks <- tibble(
  k = sample(1:10, size = n, replace = TRUE),
  scale = sample(seq(0.5, 2.5, by = 0.5), size = n, replace = TRUE)
) %>%
  mutate(
    samp = map2(k, scale, ~ rgamma(1000, shape = .x, scale = .y)),
    theor_mu = k*scale,
    mu = map_dbl(samp, ~ mean(.x)),
    lwr = map_dbl(samp, ~ quantile(.x, 0.025)),
    upr = map_dbl(samp, ~ quantile(.x, 0.975))
  ) %>%
  select(-samp)
randks
# # A tibble: 4 x 6
#       k scale theor_mu    mu   lwr   upr
#   <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
# 1    10   2       20   19.9   9.47 33.7 
# 2    10   1.5     15   15.1   7.36 25.0 
# 3     3   2        6    5.85  1.08 14.5 
# 4     9   0.5      4.5  4.51  1.99  7.72

Guesser1

Single row at a time:

guesser1 <- function(mu, lwr, upr, k.max = 10, k.by = 0.01) {
  stopifnot(length(mu) == 1, length(lwr) == 1, length(upr) == 1)
  ks <- seq(0, k.max, by = k.by)[-1]
  L <- sapply(ks, function(k) qgamma(0.025, shape = k, scale = mu / k))
  U <- sapply(ks, function(k) qgamma(0.975, shape = k, scale = mu / k))
  dists <- sqrt((L-lwr)^2 + (U-upr)^2)
  ind <- which.min(dists)
  data.frame(
    k     = ks[ind],
    scale = mu / ks[ind],
    dist  = min(dists),
    lwr   = L[ind],
    upr   = U[ind]
  )
}

In action:

out1 <- do.call(rbind, Map(guesser1, randks$mu, randks$lwr, randks$upr))
cbind(subset(randks, select = -theor_mu), out1)
#    k scale    mu  lwr   upr     k scale  dist  lwr   upr
# 1 10   2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10   1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3  3   2.0  5.85 1.08 14.50  2.76 2.120 0.020 1.10 14.50
# 4  9   0.5  4.51 1.99  7.72  9.55 0.472 0.142 2.12  7.79
###  \____ randks __________/  \____ guessed ____________/

There are certainly some differences, underscoring my original assertion that this is imperfect.

Guessers

All rows at once. This is a little more work in the function, since it deals with matrices instead of just vectors. Not a problem, I just wanted to prove it one-at-a-time before going for the gusto.

guessers <- function(mu, lwr, upr, k.max = 10, k.by = 0.01, include.size = FALSE) {
  stopifnot(length(mu) == length(lwr), length(mu) == length(upr))
  # count <- length(mu)
  ks <- seq(0, k.max, by = k.by)[-1]
  # 'ks'      dims: [mu]
  L <- outer(mu, ks, function(m, k) qgamma(0.025, shape = k, scale = m / k))
  U <- outer(mu, ks, function(m, k) qgamma(0.975, shape = k, scale = m / k))
  # 'L' & 'U' dims: [mu, ks]
  dists <- sqrt((L - lwr)^2 + (U - upr)^2)
  inds <- apply(dists, 1, which.min)
  mindists <- apply(dists, 1, min)
  i <- seq_along(mu)
  out <- data.frame(
    k = ks[inds],
    scale = mu / ks[inds],
    dist = mindists,
    lwr = L[cbind(i, inds)],
    upr = U[cbind(i, inds)]
  )
  size <- if (include.size) {
    message("guessers memory: ",
            object.size(list(ks, L, U, dists, inds, mindists, i, out)))
  }
  out
}

In action:

outs <- guessers(randks$mu, randks$lwr, randks$upr, include.size = TRUE)
# guessers memory: 106400
cbind(subset(randks, select = -theor_mu), outs)
#    k scale    mu  lwr   upr     k scale  dist  lwr   upr
# 1 10   2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10   1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3  3   2.0  5.85 1.08 14.50  2.76 2.120 0.020 1.10 14.50
# 4  9   0.5  4.51 1.99  7.72  9.55 0.472 0.142 2.12  7.79
###  \____ randks __________/  \____ guessed (same) _____/

(I included a memory message in there just to track how much this can scale. It's not bad now, and that argument should definitely not be used in production. FYI.)

Comparison

Using microbenchmark , we repeat each operation a few times and compare their run times.

microbenchmark::microbenchmark(
  g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
  gs = guessers(randks$mu, randks$lwr, randks$upr)
)
# Unit: milliseconds
#  expr  min   lq mean median   uq   max neval
#    g1 27.3 28.8 33.9   29.7 33.0 131.1   100
#    gs 13.3 13.6 14.4   13.8 14.6  20.3   100

Not too surprisingly, the all-at-once guessers is a bit faster. When will this not be the case? When the number of rows gets so big that memory consumption is a problem. I don't know what that is.

Let's try the same thing, but increasing randks from 4 rows to 1000 and repeating the benchmark.

n <- 1000
# randks <- ...
nrow(randks)
# [1] 1000
microbenchmark::microbenchmark(
  g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
  gs = guessers(randks$mu, randks$lwr, randks$upr),
  times = 10
)
# Unit: seconds
#  expr  min   lq mean median   uq   max neval
#    g1 8.50 8.99 9.59   9.31 9.64 11.72    10
#    gs 3.35 3.44 3.61   3.63 3.77  3.93    10

So it's definitely faster. The median run-time for 1000 estimations is 3.63 seconds, so it appears to finish about 300/sec.

guessers memory: 24066176

(24 MiB) Actually, that doesn't seem bad at all. Decrease k.by to increase your accuracy, or increase k.by to speed this up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM