简体   繁体   English

为 data.frame 中的多个变量按组计算均值和标准差

[英]Compute mean and standard deviation by group for multiple variables in a data.frame

Edit -- This question was originally titled << Long to wide data reshaping in R >>编辑——这个问题最初的标题是《R中的长到宽数据整形》


I'm just learning R and trying to find ways to apply it to help out others in my life.我只是在学习 R 并试图找到应用它来帮助我生活中的其他人的方法。 As a test case, I'm working on reshaping some data, and I'm having trouble following the examples I've found online.作为一个测试用例,我正在重塑一些数据,但在遵循我在网上找到的示例时遇到了问题。 What I'm starting with looks like this:我开始的内容是这样的:

ID  Obs 1   Obs 2   Obs 3
1   43      48      37
1   27      29      22
1   36      32      40
2   33      38      36
2   29      32      27
2   32      31      35
2   25      28      24
3   45      47      42
3   38      40      36

And what I want to end up with will look like this:我想要的结果是这样的:

ID  Obs 1 mean  Obs 1 std dev   Obs 2 mean  Obs 2 std dev
1   x           x               x           x
2   x           x               x           x
3   x           x               x           x

And so forth.等等。 What I'm unsure of is whether I need additional information in my long-form data, or what.我不确定的是我是否需要在我的长格式数据中添加额外的信息,或者什么。 I imagine that the math part (finding the mean and standard deviations) will be the easy part, but I haven't been able to find a way that seems to work to reshape the data correctly to start in on that process.我想数学部分(找到平均值和标准差)将是简单的部分,但我一直无法找到一种似乎可以正确重塑数据以开始该过程的方法。

Thanks very much for any help.非常感谢您的帮助。

This is an aggregation problem, not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID.这是一个聚合问题,而不是最初提出的问题的重塑问题——我们希望通过 ID 将每一列聚合为均值和标准差。 There are many packages that handle such problems.有许多软件包可以处理此类问题。 In the base of R it can be done using aggregate like this (assuming DF is the input data frame):在 R 的基础中,它可以使用这样的aggregate来完成(假设DF是输入数据帧):

ag <- aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))

Note 1: A commenter pointed out that ag is a data frame for which some columns are matrices.注 1:一位评论者指出ag是一个数据框,其中一些列是矩阵。 Although initially that may seem strange, in fact it simplifies access.虽然最初这可能看起来很奇怪,但实际上它简化了访问。 ag has the same number of columns as the input DF . ag的列数与输入DF的列数相同。 Its first column ag[[1]] is ID and the ith column of the remainder ag[[i+1]] (or equivalanetly ag[-1][[i]] ) is the matrix of statistics for the ith input observation column.它的第一列ag[[1]]ID ,余数ag[[i+1]] (或等价ag[-1][[i]] )的第 i 列是第 i 个输入观察的统计矩阵柱子。 If one wishes to access the jth statistic of the ith observation it is therefore ag[[i+1]][, j] which can also be written as ag[-1][[i]][, j] .如果希望访问第 i 个观察的第 j 个统计量,则它是ag[[i+1]][, j] ,也可以写成ag[-1][[i]][, j]

On the other hand, suppose there are k statistic columns for each observation in the input (where k=2 in the question).另一方面,假设输入中的每个观察值都有k统计列(其中问题中的 k=2)。 Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex ag[[k*(i-1)+j+1]] or equivalently ag[-1][[k*(i-1)+j]] .然后,如果我们展平输出然后访问第 i 个观察列的第 j 个统计数据,我们必须使用更复杂的ag[[k*(i-1)+j+1]]或等效的ag[-1][[k*(i-1)+j]]

For example, compare the simplicity of the first expression vs. the second:例如,比较第一个表达式与第二个表达式的简单性:

ag[-1][[2]]
##        mean      sd
## [1,] 36.333 10.2144
## [2,] 32.250  4.1932
## [3,] 43.500  4.9497

ag_flat <- do.call("data.frame", ag) # flatten
ag_flat[-1][, 2 * (2-1) + 1:2]
##   Obs_2.mean Obs_2.sd
## 1     36.333  10.2144
## 2     32.250   4.1932
## 3     43.500   4.9497

Note 2: The input in reproducible form is:注 2:可重现形式的输入是:

Lines <- "ID  Obs_1   Obs_2   Obs_3
1   43      48      37
1   27      29      22
1   36      32      40
2   33      38      36
2   29      32      27
2   32      31      35
2   25      28      24
3   45      47      42
3   38      40      36"
DF <- read.table(text = Lines, header = TRUE)

There are a few different ways to go about it.有几种不同的方法可以解决这个问题。 reshape2 is a helpful package. reshape2是一个有用的包。 Personally, I like using data.table就个人而言,我喜欢使用data.table

Below is a step-by-step下面是一步一步

If myDF is your data.frame :如果myDF是您的data.frame

library(data.table)
DT <- data.table(myDF)

DT

# this will get you your mean and SD's for each column
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x)))]

# adding a `by` argument will give you the groupings
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by=ID]

# If you would like to round the values: 
DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID]

# If we want to add names to the columns 
wide <- setnames(DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID], c("ID", sapply(names(DT)[-1], paste0, c(".men", ".SD"))))

wide

   ID Obs.1.men Obs.1.SD Obs.2.men Obs.2.SD Obs.3.men Obs.3.SD
1:  1    35.333    8.021    36.333   10.214      33.0    9.644
2:  2    29.750    3.594    32.250    4.193      30.5    5.916
3:  3    41.500    4.950    43.500    4.950      39.0    4.243

Also, this may or may not be helpful此外,这可能有帮助,也可能没有帮助

> DT[, sapply(.SD, summary), .SDcols=names(DT)[-1]]
        Obs.1 Obs.2 Obs.3
Min.    25.00 28.00 22.00
1st Qu. 29.00 31.00 27.00
Median  33.00 32.00 36.00
Mean    34.22 36.11 33.22
3rd Qu. 38.00 40.00 37.00
Max.    45.00 48.00 42.00

Here is probably the simplest way to go about it (with a reproducible example ):这可能是最简单的方法(使用可重现的示例):

library(plyr)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
ddply(df, .(ID), summarize, Obs_1_mean=mean(Obs_1), Obs_1_std_dev=sd(Obs_1),
  Obs_2_mean=mean(Obs_2), Obs_2_std_dev=sd(Obs_2))

   ID  Obs_1_mean Obs_1_std_dev  Obs_2_mean Obs_2_std_dev
1  1 -0.13994642     0.8258445 -0.15186380     0.4251405
2  2  1.49982393     0.2282299  0.50816036     0.5812907
3  3 -0.09269806     0.6115075 -0.01943867     1.3348792

EDIT: The following approach saves you a lot of typing when dealing with many columns.编辑:在处理许多列时,以下方法可以为您节省大量输入。

ddply(df, .(ID), colwise(mean))

  ID      Obs_1      Obs_2      Obs_3
1  1 -0.3748831  0.1787371  1.0749142
2  2 -1.0363973  0.0157575 -0.8826969
3  3  1.0721708 -1.1339571 -0.5983944

ddply(df, .(ID), colwise(sd))

  ID     Obs_1     Obs_2     Obs_3
1  1 0.8732498 0.4853133 0.5945867
2  2 0.2978193 1.0451626 0.5235572
3  3 0.4796820 0.7563216 1.4404602

I add the dplyr solution.我添加了dplyr解决方案。

set.seed(1)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))

library(dplyr)
df %>% group_by(ID) %>% summarise_each(funs(mean, sd))

#      ID Obs_1_mean Obs_2_mean Obs_3_mean  Obs_1_sd  Obs_2_sd  Obs_3_sd
#   (int)      (dbl)      (dbl)      (dbl)     (dbl)     (dbl)     (dbl)
# 1     1  0.4854187 -0.3238542  0.7410611 1.1108687 0.2885969 0.1067961
# 2     2  0.4171586 -0.2397030  0.2041125 0.2875411 1.8732682 0.3438338
# 3     3 -0.3601052  0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692

Here's another take on the data.table answers, using @Carson's data, that's a bit more readable (and also a little faster, because of using lapply instead of sapply ):这是对data.table答案的另一种data.table ,使用data.table的数据,可读性更强(而且速度更快,因为使用lapply而不是sapply ):

library(data.table)
set.seed(1)
dt = data.table(ID=c(1:3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))

dt[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = ID]
#   ID mean.Obs_1 mean.Obs_2 mean.Obs_3  sd.Obs_1  sd.Obs_2  sd.Obs_3
#1:  1  0.4854187 -0.3238542  0.7410611 1.1108687 0.2885969 0.1067961
#2:  2  0.4171586 -0.2397030  0.2041125 0.2875411 1.8732682 0.3438338
#3:  3 -0.3601052  0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692

The updated dplyr solution, as for 2020更新的 dplyr 解决方案,至于 2020

1: summarise_each_() is deprecated as of dplyr 0.7.0. 1: summarise_each_()从 dplyr 0.7.0 开始被弃用。 and 2: funs() is deprecated as of dplyr 0.8.0.和 2:从 dplyr 0.8.0 开始不推荐使用 funs funs()

ag.dplyr <- DF %>% group_by(ID) %>% summarise(across(.cols = everything(),list(mean = mean, sd = sd)))

There is a helpful function in the psych package. psych包中有一个有用的功能。

You should try the following implementation:您应该尝试以下实现:

psych::describeBy(data$dependentvariable, group = data$groupingvariable)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从数据框中获取各组的均值和标准差 - Getting mean and standard deviation from groups in a data.frame data.frame 的随机子采样并总结平均值和标准偏差 - Random subsampling of a data.frame and summarize mean and standard deviation 如何在R中按组同时计算均值,cv和标准差 - How to compute mean, cv and standard deviation simultaneously using by group in R R data.frame 中多个变量的每小时平均值? - Hourly mean of multiple variables in R data.frame? 创建具有均值、标准差、标准误差和置信度误差的数据框 - Creating a data frame with mean, standard deviation, standard error and confidence error 如何通过在 R 中创建额外的列(均值和标准差)来获得相同数据帧的均值和标准差的结果 - How to get the results of a mean and standard deviation to the same data frame by creating extra columns (mean and standard deviation) in R 在 data.frame 中按组显示加权平均值 - Display weighted mean by group in the data.frame 绘制R中数据框中每个数值的平均值和标准差 - Plotting mean and standard deviation for every numeric value in data frame in R 查找数据框中所有值的平均值和标准偏差 - Find the mean and standard deviation of all values in a data frame 计算 data.frame 中多列的平均值 - calculate mean for multiple columns in data.frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM