I work a lot with epidemiologists and statisticians that have very specific requirements for their statistical output and I frequently fail to reproduce the exact same thing in R (our epidemiologst works in Stata).
Let's start with an easy example, a Student's t-test. What we are interested in is the difference in mean age at first diagnosis and a confidence interval.
1) create some sample data in R
set.seed(41)
cohort <- data.frame(
id = seq(1,100),
gender = sample(c(rep(1,33), rep(2,67)),100),
age = sample(seq(0,50),100, replace=TRUE)
)
# save to import into Stata
# write.csv(cohort, "cohort.csv", row.names = FALSE)
b) import data and run t-test in Stata
import delimited "cohort.csv"
ttest age, by(gender)
What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
b) run t-test in R
t.test(age~gender, data=cohort)
t.test(cohort$age[cohort$gender == 1])
t.test(cohort$age[cohort$gender == 2])
t.test(cohort$age)
There surely must be another way instead of running 4 t-tests in R!
You can try to put everything in one function and some tidyverse
magic. The output can be edited as your needs are of course. boom
's tidy
will be used for nice output.
foo <- function(df, x, y){
require(tidyverse)
require(broom)
a1 <- df %>%
select(ep=!!x, gr=!!y) %>%
mutate(gr=as.character(gr)) %>%
bind_rows(mutate(., gr="ALL")) %>%
split(.$gr) %>%
map(~tidy(t.test(.$ep))) %>%
bind_rows(.,.id = "gr") %>%
mutate_if(is.factor, as.character)
tidy(t.test(as.formula(paste(x," ~ ",y)), data=df)) %>%
mutate_if(is.factor, as.character) %>%
mutate(gr="vs") %>%
select(gr, estimate, statistic, p.value,parameter, conf.low, conf.high, method, alternative) %>%
bind_rows(a1, .)}
foo(cohort, "age", "gender")
gr estimate statistic p.value parameter conf.low conf.high method alternative
1 1 25.121212 9.545737 6.982763e-11 32.00000 19.76068 30.481745 One Sample t-test two.sided
2 2 28.791045 15.699854 5.700541e-24 66.00000 25.12966 32.452428 One Sample t-test two.sided
3 ALL 27.580000 18.301678 1.543834e-33 99.00000 24.58985 30.570147 One Sample t-test two.sided
4 vs -3.669833 -1.144108 2.568817e-01 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided
I recommend to start from the beginning using this
foo <- function(df){
a1 <- broom::tidy(t.test(age~gender, data=df))
a2 <- broom::tidy(t.test(df$age))
a3 <- broom::tidy(t.test(df$age[df$gender == 1]))
a4 <- broom::tidy(t.test(df$age[df$gender == 2]))
list(rbind(a2, a3, a4), a1)
}
foo(cohort)
[[1]]
estimate statistic p.value parameter conf.low conf.high method alternative
1 27.58000 18.301678 1.543834e-33 99 24.58985 30.57015 One Sample t-test two.sided
2 25.12121 9.545737 6.982763e-11 32 19.76068 30.48174 One Sample t-test two.sided
3 28.79104 15.699854 5.700541e-24 66 25.12966 32.45243 One Sample t-test two.sided
[[2]]
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
1 -3.669833 25.12121 28.79104 -1.144108 0.2568817 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided
You can make your own function:
tlimits <- function(data, group){
error <- qt(0.975, df = length(data)-1)*sd(data)/(sqrt(length(data)))
mean <- mean(data)
means <- tapply(data, group, mean)
c(abs(means[1] - means[2]), mean - error, mean + error)
}
tlimits(cohort$age, cohort$gender)
1
3.669833 24.589853 30.570147
What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
Notice that R's t.test
does a t-test, whereas you want a mean difference and "combined confidence intervals" (which is CI around the mean ignoring the grouping variable). So you don't want a t-test but something else.
You can get the mean difference using, eg:
diff(with(cohort, tapply(age, gender, mean)))
# 3.669833
# no point in using something more complicated e.g., t-test or lm
... and the CI using, eg:
confint(lm(age~1, data=cohort))
# 2.5 % 97.5 %
# (Intercept) 24.58985 30.57015
And obviously, you can easily combine the two steps into one function if you need it often.
doit <- function(a,b) c(diff= diff(tapply(a,b,mean)), CI=confint(lm(a~1)))
with(cohort, doit(age,gender))
# diff.2 CI1 CI2
# 3.669833 24.589853 30.570147
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.