简体   繁体   中英

R- Collapse rows and sum the values in the column

I have the following dataframe (df1):

 ID someText PSM OtherValues ABC c 2 qwe CCC v 3 wer DDD b 56 ert EEE m 78 yu FFF sw 1 io GGG e 90 gv CCC r 34 scf CCC t 21 fvb KOO y 45 hffd EEE u 2 asd LLL i 4 dlm ZZZ i 8 zzas 

I would like to collapse the first column and add the corresponding PSM values and I would like to get the following output:

ID  Sum PSM
ABC 2
CCC 58
DDD 56
EEE 80
FFF 1
GGG 90
KOO 45
LLL 4
ZZZ 8

It seems doable with aggregate function but don't know the syntax. Any help is really appreciated! Thanks.

In base:

aggregate(PSM ~ ID, data=x, FUN=sum)
##    ID PSM
## 1 ABC   2
## 2 CCC  58
## 3 DDD  56
## 4 EEE  80
## 5 FFF   1
## 6 GGG  90
## 7 KOO  45
## 8 LLL   4
## 9 ZZZ   8

Example using dplyr, the next iteration of plyr:

df2 <- df1 %>% group_by(ID) %>%
     summarize(Sum_PSM = sum(PSM))

When you put the characters %>% , you are "piping." This means you're inputting what is on the left side of that pipe operator and performing the function on the right.

This is super easy using the plyr package:

library(plyr)
ddply(df1, .(ID), summarize, Sum=sum(PSM))

Using aggregate function seems to be better than dplyr if you want to just keep the original column names and operate inside one column at a time. Avoiding the use of summarize function,

Note from summarize function documentation

Be careful when using existing variable names; the corresponding columns will be immediately updated with the new data and this can affect subsequent operations referring to those variables.

For instance

## modified example from aggregate documentation with character variables and NAs
testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
                 v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
by <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)

aggregate(x = testDF, by = list(by1), FUN = "sum")
Group.1 v1  v2
1       1 15 165
2      12  9  99
3       2 NA  NA
4     big  3  33
5    blue  3  33
6     red  5  55

You get what you want, but when you use summarise and ddply you need to specify names. So if you have many columns aggregate seems to be convenient.

testDF$ID=by1
ddply(testDF, .(ID), summarize, v1=sum(v1), v2=sum(v2) )
ID v1  v2
1    1 15 165
2   12  9  99
3    2 NA  NA
4  big  3  33
5 blue  3  33
6  red  5  55
7 <NA> 15 165

To see the effect of the immediate update of the columns with summarize you can check the following examples,

ddply(testDF, .(ID), summarize, v1=max(v1,v2), v2=min(v1,v2) )
ID v1 v2
1    1 55 55
2   12 99 99
3    2 NA NA
4  big 33 33
5 blue 33 33
6  red 44 11
7 <NA> 88 77

ddply(testDF, .(ID), summarize, v1=min(v1,v2), v2=min(v1,v2) )
ID v1 v2
1    1  5  5
2   12  9  9
3    2 NA NA
4  big  3  3
5 blue  3  3
6  red  1  1
7 <NA>  7  7

Note that when V1 uses max, the col is already update when calculating v2, so for instance in the case of ID=1 we can't get the number 5 when using min in v2.

使用data.table

setDT(df1)[,  lapply(.SD, sum) , by = ID, .SDcols = "PSM" ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM