Use the lapply and ddply functions

Question

I am trying to use ddply to my sample data (call Z) which look like as below:

My purpose is the find the sum of the y for the id starting with 1 (ie1001,1200,..), 2(2100), 3(3100,3190), 4,...10,11,...65. For example, for id starting with 1 , the sum is 10+11+12=33, for id starting with 2, it is 32.

I wanted to use the apply function which looks like as follows:

>s <- split(z,z$id)
>lapply(s, function(x) colSums(x[, c("y")]))

However, this gives me the sum by each of the unique id, not the one as I was looking for. Any suggestion in this regard would be highly appreciated.

Answer 1

Here is a data.table solution that uses %/% to perform integer division (return how many thousands)

library(data.table)
DT <- data.table(z)

x <- DT[,list(sum_y = sum(y)), by = list(id = id %/% 1000)]
x
   id sum_y
1:  1    33
2:  2    54
3:  3    23
4:  4    45
5:  5   123
6: 10    99

You could do the similar with ddply

ddply(z, .(id = id %/% 1000 ), summarize, sum_y = sum(y))
  id sum_y
1  1    33
2  2    54
3  3    23
4  4    45
5  5   123
6 10    99

Answer 2

Does this give you the intended answer?

z <- read.table(textConnection("id y
1001 10
1001 11
1200 12
2001 10
2030 12
2100 32
3100 10
3190 13
4100 45
5100 67
5670 56
10001 54
10345 45"),header=TRUE)

result <- tapply(
                 z$y,
                 as.numeric(substr(z$id,1,nchar(z$id)-3)),
                 sum
                )

result
  1   2   3   4   5  10 
 33  54  23  45 123  99

To steal @mnel's line from above, this could be simplified to:

result <- tapply(
                 z$y,
                 z$id %/% 1000,
                 sum
                )

Answer 3

thelatemail provides a valid approach but I want to point out the problem isn't really with your understanding of lapply (your code was almost correct) but with thinking about grouping. thelatemail does this in his solution and that's the key. I'm going to show you with your approach and then how I would actually approach this and then using ave just because I never get to use it :)

Read in data

z <- read.table(textConnection("id y #stole this from the latemail
1001 10
1001 11
1200 12
2001 10
2030 12
2100 32
3100 10
3190 13
4100 45
5100 67
5670 56
10001 54
10345 45"),header=TRUE)

Your code adjusted

s <- split(z, substring(as.character(z$id), 1, nchar(as.character(z$id)) - 3))
lapply(s, function(x) sum(x[, "y"]))

Approach I would likely take; add a new factor id variable

z$IDgroup <- substring(as.character(z$id), 1, nchar(as.character(z$id)) - 3)
aggregate(y ~ IDgroup, z, sum)
#similar approach but adds the solution back as a new column
z$group.sum <- ave(z$y, z$IDgroup, FUN=sum)
z

Use the lapply and ddply functions

Question

3 answers

solution1
5 2012-11-12 05:16:58

solution2
3 2012-11-12 05:07:44

solution3
3 ACCPTED 2012-11-12 05:24:01

Use the lapply and ddply functions

Question

3 answers

solution1 5 2012-11-12 05:16:58

solution2 3 2012-11-12 05:07:44

solution3 3 ACCPTED 2012-11-12 05:24:01

solution1
5 2012-11-12 05:16:58

solution2
3 2012-11-12 05:07:44

solution3
3 ACCPTED 2012-11-12 05:24:01