简体   繁体   中英

Loop or apply for sum of rows based on multiple conditions in R dataframe

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.

I have a dataset as follows:

id  count   group
2   6   A
2   8   A
2   6   A
8   5   A
8   6   A
8   3   A
10  6   B
10  6   B
10  6   B
11  5   B
11  6   B
11  7   B
16  6   C
16  2   C
16  0   C
18  6   C
18  1   C
18  6   C

I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (eg 6+8=14 for ID 2). I also want to attach the correct group identifier.

In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.

This is what I've come up with:

id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)

newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
  newid[i] <- unique(df$"id")[i]
  newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
  newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}

newdf<-data.frame(newid,newcount,newgroup)

Some possible improvements/alternatives I'm not sure about:

  • For loops vs apply functions
  • Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
  • More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)

You can try to use a self-defined function in aggregate

sum1sttwo<-function (x){
  return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)

and the output is:

  id group count
1  2     A    14
2  8     A    11
3 10     B    12
4 11     B    11
5 16     C     8
6 18     C     7

04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.

You could use dplyr :

library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count)) 

The pipe syntax makes it easy to read: group your data by id and group , take the first two rows for each group, then sum the counts

You could do this using data.table

setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
#    id group newcount
#1:  2     A       14
#2:  8     A       11
#3: 10     B       12
#4: 11     B       11
#5: 16     C        8
#6: 18     C        7
    library(plyr)

    -Keep first 2 rows for each group and id
    df2 <-  ddply(df, c("id","group"), function (x) x$count[1:2])

    -Aggregate by group and id
    df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)

    df3
    id group count
  1  2     A    14
  2  8     A    11
  3 10     B    12
  4 11     B    11
  5 16     C     8
  6 18     C     7

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM