简体   繁体   English

R-折叠行并对列中的值求和

[英]R- Collapse rows and sum the values in the column

I have the following dataframe (df1): 我有以下数据帧(df1):

 ID someText PSM OtherValues ABC c 2 qwe CCC v 3 wer DDD b 56 ert EEE m 78 yu FFF sw 1 io GGG e 90 gv CCC r 34 scf CCC t 21 fvb KOO y 45 hffd EEE u 2 asd LLL i 4 dlm ZZZ i 8 zzas 

I would like to collapse the first column and add the corresponding PSM values and I would like to get the following output: 我想折叠第一列并添加相应的PSM值,我想得到以下输出:

ID  Sum PSM
ABC 2
CCC 58
DDD 56
EEE 80
FFF 1
GGG 90
KOO 45
LLL 4
ZZZ 8

It seems doable with aggregate function but don't know the syntax. 它似乎可以使用聚合函数,但不知道语法。 Any help is really appreciated! 任何帮助真的很感激! Thanks. 谢谢。

In base: 在基地:

aggregate(PSM ~ ID, data=x, FUN=sum)
##    ID PSM
## 1 ABC   2
## 2 CCC  58
## 3 DDD  56
## 4 EEE  80
## 5 FFF   1
## 6 GGG  90
## 7 KOO  45
## 8 LLL   4
## 9 ZZZ   8

Example using dplyr, the next iteration of plyr: 使用dplyr的示例,plyr的下一次迭代:

df2 <- df1 %>% group_by(ID) %>%
     summarize(Sum_PSM = sum(PSM))

When you put the characters %>% , you are "piping." 当你把字符%>% ,你是“管道”。 This means you're inputting what is on the left side of that pipe operator and performing the function on the right. 这意味着您正在输入该管道操作员左侧的内容并执行右侧的功能。

This is super easy using the plyr package: 使用plyr包非常容易:

library(plyr)
ddply(df1, .(ID), summarize, Sum=sum(PSM))

Using aggregate function seems to be better than dplyr if you want to just keep the original column names and operate inside one column at a time. 如果你想保留原始列名并一次在一列内运行,那么使用聚合函数似乎比dplyr更好。 Avoiding the use of summarize function, 避免使用汇总功能,

Note from summarize function documentation 总结功能文档中的注释

Be careful when using existing variable names; 使用现有变量名时要小心; the corresponding columns will be immediately updated with the new data and this can affect subsequent operations referring to those variables. 相应的列将立即使用新数据更新,这可能会影响引用这些变量的后续操作。

For instance 例如

## modified example from aggregate documentation with character variables and NAs
testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
                 v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
by <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)

aggregate(x = testDF, by = list(by1), FUN = "sum")
Group.1 v1  v2
1       1 15 165
2      12  9  99
3       2 NA  NA
4     big  3  33
5    blue  3  33
6     red  5  55

You get what you want, but when you use summarise and ddply you need to specify names. 你得到了你想要的东西,但是当你使用summarize和ddply时,你需要指定名称。 So if you have many columns aggregate seems to be convenient. 所以如果你有很多列聚合似乎很方便。

testDF$ID=by1
ddply(testDF, .(ID), summarize, v1=sum(v1), v2=sum(v2) )
ID v1  v2
1    1 15 165
2   12  9  99
3    2 NA  NA
4  big  3  33
5 blue  3  33
6  red  5  55
7 <NA> 15 165

To see the effect of the immediate update of the columns with summarize you can check the following examples, 要查看使用汇总立即更新列的效果,可以查看以下示例,

ddply(testDF, .(ID), summarize, v1=max(v1,v2), v2=min(v1,v2) )
ID v1 v2
1    1 55 55
2   12 99 99
3    2 NA NA
4  big 33 33
5 blue 33 33
6  red 44 11
7 <NA> 88 77

ddply(testDF, .(ID), summarize, v1=min(v1,v2), v2=min(v1,v2) )
ID v1 v2
1    1  5  5
2   12  9  9
3    2 NA NA
4  big  3  3
5 blue  3  3
6  red  1  1
7 <NA>  7  7

Note that when V1 uses max, the col is already update when calculating v2, so for instance in the case of ID=1 we can't get the number 5 when using min in v2. 请注意,当V1使用max时,col在计算v2时已经更新,因此例如在ID = 1的情况下,当在v2中使用min时,我们无法获得数字5。

使用data.table

setDT(df1)[,  lapply(.SD, sum) , by = ID, .SDcols = "PSM" ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM