繁体   English   中英

使用 R 计算变量的均值

[英]Compute mean across variable using R

我在创建一个数据集时遇到了一些麻烦,该数据集在我尝试的代码下方,在一个变量的级别(在我的情况下,该变量是数据集 df1 中的危机_t)的平均中位数为 25 和 75%。 问题是百分位数计算不正确,我不明白为什么。 任何的想法 ?

#what I have
country <- c("AT","AT","AT","AT","BE","BE","BE","BE","DE","DE","DE")
crisis_t  <- c(-1,0,1,2,-1,0,1,2,0,1,2)
value1  <- c(0.01,0.02,0.015,0.03,0.5,0.55,0.7,0.4,0.01,0.02,0.04)

df1 <- data.frame(country, crisis_t,value1)

#what I would like to obtain

crisis_t <- c(-1,0,1,2)
mean_t   <- c(0.255,0.193,0.245,0.156)
median_t <- c(0.255,0.02,0.02,0.04)
perc_25  <- c(NA,0.01,0.015,0.03)
perc_75  <- c(NA,0.55,0.7,0.4)

df2 <- data.frame(crisis_t, mean_t, median_t, perc_25, perc_75)

#my code does not compute correctly the 25th quantile
df1 <- as.data.table(df1)
df2_try <- data.table()
df2_try <- df1[,mean_t2:=mean(value1, na.rm=TRUE),by=.(crisis_t)]
df2_try <- df1[,median_t2:=median(value1, na.rm=TRUE),by=.(crisis_t)]
df2_try <- df1[,perc_25:=quantile(value1, probs=0.25),by=.(crisis_t)]
df2_try <- df1[,perc_75:=quantile(value1, probs=0.75),by=.(crisis_t)]

df2_try

谢谢您的帮助。

编辑:实际数据集。

country       <- c("AT","AT","AT","AT","BE","BE","BE","BE","BE","BE","BE","DE","DE","DE")
crisis_AT_1   <- c(-1,0,1,2,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
crisis_BE_1   <- c(NA,NA,NA,NA,-1,0,1,2,3,4,5,6,NA,NA)
crisis_BE_2   <- c(NA,NA,NA,NA,-4,-3,-2,-1,0,1,2,-2,NA,NA)
crisis_DE_1   <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,0)
value1        <- c(0.01,0.02,0.015,0.03,0.5,0.55,0.7,0.4,0.01,0.02,0.04,0.02,0.14,0.21)

df3 <- data.frame(country, crisis_AT_1,crisis_BE_1,crisis_BE_2,crisis_DE_1,value1)

默认情况下, quantile函数将使用quantile的连续版本。 这意味着,如果您定义的分位数中没有数字,则考虑给定的经验分布,它会估计应该在其中的数字。

从您的预期输出来看,您似乎想要分位数类型 2 ,它将在离散经验分布上对分位数进行采样,但它将在不连续点的中间取平均值。 您可以按如下方式使用它:

df1 <- as.data.table(df1)
df2_try <- copy(df1)
df2_try[,mean_t2:=  mean(value1),by=.(crisis_t)]
df2_try[,median_t2:=quantile(value1, 0.50, type=2),by=.(crisis_t)]
df2_try[,perc_25:=  quantile(value1, 0.25, type=2),by=.(crisis_t)]
df2_try[,perc_75:=  quantile(value1, 0.75, type=2),by=.(crisis_t)]

但是,这不会像您想要的那样返回NA ,因为最小值在分位数 0 中,最大值在分位数 1 中,分位数 25% 和 75% 确实具有与它们相关的值。 尽管如此,如果您真的需要ifelse ,您可以强制执行该行为。

顺便说一下,您不需要在每次修改后分配df2_try data.table ,您正在执行的更改已经就位(它们更改了对象本身)。 所以你可以像我在例子中那样做。 我使用data.tablecopy函数来复制原始 data.table df1和修改后的版本df2_try

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM