简体   繁体   中英

Keeping your statistician happy: Stata vs. R Student's t-test

Chapter 1: mean age by gender

I work a lot with epidemiologists and statisticians that have very specific requirements for their statistical output and I frequently fail to reproduce the exact same thing in R (our epidemiologst works in Stata).

Let's start with an easy example, a Student's t-test. What we are interested in is the difference in mean age at first diagnosis and a confidence interval.

1) create some sample data in R

set.seed(41)

cohort <- data.frame(
          id = seq(1,100),
          gender = sample(c(rep(1,33), rep(2,67)),100),
          age    = sample(seq(0,50),100, replace=TRUE)
          )

# save to import into Stata
# write.csv(cohort, "cohort.csv", row.names = FALSE)

b) import data and run t-test in Stata

import delimited "cohort.csv"
ttest age, by(gender)

在此输入图像描述

What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57

b) run t-test in R

t.test(age~gender, data=cohort)

在此输入图像描述

t.test(cohort$age[cohort$gender == 1])

在此输入图像描述

t.test(cohort$age[cohort$gender == 2])

在此输入图像描述

t.test(cohort$age)

在此输入图像描述

There surely must be another way instead of running 4 t-tests in R!

You can try to put everything in one function and some tidyverse magic. The output can be edited as your needs are of course. boom 's tidy will be used for nice output.

foo <- function(df, x, y){
  require(tidyverse)
  require(broom)
  a1 <- df %>% 
    select(ep=!!x, gr=!!y) %>% 
    mutate(gr=as.character(gr)) %>% 
    bind_rows(mutate(., gr="ALL")) %>% 
    split(.$gr) %>% 
    map(~tidy(t.test(.$ep))) %>% 
    bind_rows(.,.id = "gr") %>% 
    mutate_if(is.factor, as.character)
  tidy(t.test(as.formula(paste(x," ~ ",y)), data=df)) %>% 
    mutate_if(is.factor, as.character) %>% 
    mutate(gr="vs") %>% 
    select(gr, estimate, statistic, p.value,parameter, conf.low, conf.high, method, alternative) %>% 
    bind_rows(a1, .)}


foo(cohort, "age", "gender")
   gr  estimate statistic      p.value parameter  conf.low conf.high                  method alternative
1   1 25.121212  9.545737 6.982763e-11  32.00000  19.76068 30.481745       One Sample t-test   two.sided
2   2 28.791045 15.699854 5.700541e-24  66.00000  25.12966 32.452428       One Sample t-test   two.sided
3 ALL 27.580000 18.301678 1.543834e-33  99.00000  24.58985 30.570147       One Sample t-test   two.sided
4  vs -3.669833 -1.144108 2.568817e-01  63.37702 -10.07895  2.739284 Welch Two Sample t-test   two.sided

I recommend to start from the beginning using this

foo <- function(df){
 a1 <- broom::tidy(t.test(age~gender, data=df))
 a2 <- broom::tidy(t.test(df$age))
 a3 <- broom::tidy(t.test(df$age[df$gender == 1]))
 a4 <- broom::tidy(t.test(df$age[df$gender == 2]))
 list(rbind(a2, a3, a4), a1)
}

foo(cohort)
[[1]]
  estimate statistic      p.value parameter conf.low conf.high            method alternative
1 27.58000 18.301678 1.543834e-33        99 24.58985  30.57015 One Sample t-test   two.sided
2 25.12121  9.545737 6.982763e-11        32 19.76068  30.48174 One Sample t-test   two.sided
3 28.79104 15.699854 5.700541e-24        66 25.12966  32.45243 One Sample t-test   two.sided

[[2]]
   estimate estimate1 estimate2 statistic   p.value parameter  conf.low conf.high                  method alternative
1 -3.669833  25.12121  28.79104 -1.144108 0.2568817  63.37702 -10.07895  2.739284 Welch Two Sample t-test   two.sided

You can make your own function:

tlimits <- function(data, group){
  error <- qt(0.975, df = length(data)-1)*sd(data)/(sqrt(length(data)))
  mean <- mean(data)
  means <- tapply(data, group, mean)
  c(abs(means[1] - means[2]), mean - error, mean + error)
}

tlimits(cohort$age, cohort$gender)
        1                     
 3.669833 24.589853 30.570147 

What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57

Notice that R's t.test does a t-test, whereas you want a mean difference and "combined confidence intervals" (which is CI around the mean ignoring the grouping variable). So you don't want a t-test but something else.

You can get the mean difference using, eg:

diff(with(cohort, tapply(age, gender, mean)))
# 3.669833 
# no point in using something more complicated e.g., t-test or lm

... and the CI using, eg:

confint(lm(age~1, data=cohort))
#                2.5 %   97.5 %
# (Intercept) 24.58985 30.57015

And obviously, you can easily combine the two steps into one function if you need it often.

doit <- function(a,b) c(diff= diff(tapply(a,b,mean)), CI=confint(lm(a~1)))
with(cohort, doit(age,gender))
#   diff.2       CI1       CI2 
# 3.669833 24.589853 30.570147 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM