[英]Compute mean and standard deviation by group for multiple variables in a data.frame
Edit -- This question was originally titled << Long to wide data reshaping in R >>编辑——这个问题最初的标题是《R中的长到宽数据整形》
I'm just learning R and trying to find ways to apply it to help out others in my life.我只是在学习 R 并试图找到应用它来帮助我生活中的其他人的方法。 As a test case, I'm working on reshaping some data, and I'm having trouble following the examples I've found online.
作为一个测试用例,我正在重塑一些数据,但在遵循我在网上找到的示例时遇到了问题。 What I'm starting with looks like this:
我开始的内容是这样的:
ID Obs 1 Obs 2 Obs 3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36
And what I want to end up with will look like this:我想要的结果是这样的:
ID Obs 1 mean Obs 1 std dev Obs 2 mean Obs 2 std dev
1 x x x x
2 x x x x
3 x x x x
And so forth.等等。 What I'm unsure of is whether I need additional information in my long-form data, or what.
我不确定的是我是否需要在我的长格式数据中添加额外的信息,或者什么。 I imagine that the math part (finding the mean and standard deviations) will be the easy part, but I haven't been able to find a way that seems to work to reshape the data correctly to start in on that process.
我想数学部分(找到平均值和标准差)将是简单的部分,但我一直无法找到一种似乎可以正确重塑数据以开始该过程的方法。
Thanks very much for any help.非常感谢您的帮助。
This is an aggregation problem, not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID.这是一个聚合问题,而不是最初提出的问题的重塑问题——我们希望通过 ID 将每一列聚合为均值和标准差。 There are many packages that handle such problems.
有许多软件包可以处理此类问题。 In the base of R it can be done using
aggregate
like this (assuming DF
is the input data frame):在 R 的基础中,它可以使用这样的
aggregate
来完成(假设DF
是输入数据帧):
ag <- aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))
Note 1: A commenter pointed out that ag
is a data frame for which some columns are matrices.注 1:一位评论者指出
ag
是一个数据框,其中一些列是矩阵。 Although initially that may seem strange, in fact it simplifies access.虽然最初这可能看起来很奇怪,但实际上它简化了访问。
ag
has the same number of columns as the input DF
. ag
的列数与输入DF
的列数相同。 Its first column ag[[1]]
is ID
and the ith column of the remainder ag[[i+1]]
(or equivalanetly ag[-1][[i]]
) is the matrix of statistics for the ith input observation column.它的第一列
ag[[1]]
是ID
,余数ag[[i+1]]
(或等价ag[-1][[i]]
)的第 i 列是第 i 个输入观察的统计矩阵柱子。 If one wishes to access the jth statistic of the ith observation it is therefore ag[[i+1]][, j]
which can also be written as ag[-1][[i]][, j]
.如果希望访问第 i 个观察的第 j 个统计量,则它是
ag[[i+1]][, j]
,也可以写成ag[-1][[i]][, j]
。
On the other hand, suppose there are k
statistic columns for each observation in the input (where k=2 in the question).另一方面,假设输入中的每个观察值都有
k
统计列(其中问题中的 k=2)。 Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex ag[[k*(i-1)+j+1]]
or equivalently ag[-1][[k*(i-1)+j]]
.然后,如果我们展平输出然后访问第 i 个观察列的第 j 个统计数据,我们必须使用更复杂的
ag[[k*(i-1)+j+1]]
或等效的ag[-1][[k*(i-1)+j]]
。
For example, compare the simplicity of the first expression vs. the second:例如,比较第一个表达式与第二个表达式的简单性:
ag[-1][[2]]
## mean sd
## [1,] 36.333 10.2144
## [2,] 32.250 4.1932
## [3,] 43.500 4.9497
ag_flat <- do.call("data.frame", ag) # flatten
ag_flat[-1][, 2 * (2-1) + 1:2]
## Obs_2.mean Obs_2.sd
## 1 36.333 10.2144
## 2 32.250 4.1932
## 3 43.500 4.9497
Note 2: The input in reproducible form is:注 2:可重现形式的输入是:
Lines <- "ID Obs_1 Obs_2 Obs_3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36"
DF <- read.table(text = Lines, header = TRUE)
There are a few different ways to go about it.有几种不同的方法可以解决这个问题。
reshape2
is a helpful package. reshape2
是一个有用的包。 Personally, I like using data.table
就个人而言,我喜欢使用
data.table
Below is a step-by-step下面是一步一步
If myDF
is your data.frame
:如果
myDF
是您的data.frame
:
library(data.table)
DT <- data.table(myDF)
DT
# this will get you your mean and SD's for each column
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x)))]
# adding a `by` argument will give you the groupings
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by=ID]
# If you would like to round the values:
DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID]
# If we want to add names to the columns
wide <- setnames(DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID], c("ID", sapply(names(DT)[-1], paste0, c(".men", ".SD"))))
wide
ID Obs.1.men Obs.1.SD Obs.2.men Obs.2.SD Obs.3.men Obs.3.SD
1: 1 35.333 8.021 36.333 10.214 33.0 9.644
2: 2 29.750 3.594 32.250 4.193 30.5 5.916
3: 3 41.500 4.950 43.500 4.950 39.0 4.243
Also, this may or may not be helpful此外,这可能有帮助,也可能没有帮助
> DT[, sapply(.SD, summary), .SDcols=names(DT)[-1]]
Obs.1 Obs.2 Obs.3
Min. 25.00 28.00 22.00
1st Qu. 29.00 31.00 27.00
Median 33.00 32.00 36.00
Mean 34.22 36.11 33.22
3rd Qu. 38.00 40.00 37.00
Max. 45.00 48.00 42.00
Here is probably the simplest way to go about it (with a reproducible example ):这可能是最简单的方法(使用可重现的示例):
library(plyr)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
ddply(df, .(ID), summarize, Obs_1_mean=mean(Obs_1), Obs_1_std_dev=sd(Obs_1),
Obs_2_mean=mean(Obs_2), Obs_2_std_dev=sd(Obs_2))
ID Obs_1_mean Obs_1_std_dev Obs_2_mean Obs_2_std_dev
1 1 -0.13994642 0.8258445 -0.15186380 0.4251405
2 2 1.49982393 0.2282299 0.50816036 0.5812907
3 3 -0.09269806 0.6115075 -0.01943867 1.3348792
EDIT: The following approach saves you a lot of typing when dealing with many columns.编辑:在处理许多列时,以下方法可以为您节省大量输入。
ddply(df, .(ID), colwise(mean))
ID Obs_1 Obs_2 Obs_3
1 1 -0.3748831 0.1787371 1.0749142
2 2 -1.0363973 0.0157575 -0.8826969
3 3 1.0721708 -1.1339571 -0.5983944
ddply(df, .(ID), colwise(sd))
ID Obs_1 Obs_2 Obs_3
1 1 0.8732498 0.4853133 0.5945867
2 2 0.2978193 1.0451626 0.5235572
3 3 0.4796820 0.7563216 1.4404602
I add the dplyr
solution.我添加了
dplyr
解决方案。
set.seed(1)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
library(dplyr)
df %>% group_by(ID) %>% summarise_each(funs(mean, sd))
# ID Obs_1_mean Obs_2_mean Obs_3_mean Obs_1_sd Obs_2_sd Obs_3_sd
# (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961
# 2 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338
# 3 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692
Here's another take on the data.table
answers, using @Carson's data, that's a bit more readable (and also a little faster, because of using lapply
instead of sapply
):这是对
data.table
答案的另一种data.table
,使用data.table
的数据,可读性更强(而且速度更快,因为使用lapply
而不是sapply
):
library(data.table)
set.seed(1)
dt = data.table(ID=c(1:3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
dt[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = ID]
# ID mean.Obs_1 mean.Obs_2 mean.Obs_3 sd.Obs_1 sd.Obs_2 sd.Obs_3
#1: 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961
#2: 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338
#3: 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692
The updated dplyr solution, as for 2020更新的 dplyr 解决方案,至于 2020
1: summarise_each_()
is deprecated as of dplyr 0.7.0. 1:
summarise_each_()
从 dplyr 0.7.0 开始被弃用。 and 2: funs()
is deprecated as of dplyr 0.8.0.和 2:从 dplyr 0.8.0 开始不推荐使用 funs
funs()
。
ag.dplyr <- DF %>% group_by(ID) %>% summarise(across(.cols = everything(),list(mean = mean, sd = sd)))
There is a helpful function in the psych
package. psych
包中有一个有用的功能。
You should try the following implementation:您应该尝试以下实现:
psych::describeBy(data$dependentvariable, group = data$groupingvariable)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.