简体   繁体   中英

Use the lapply and ddply functions

I am trying to use ddply to my sample data (call Z) which look like as below:

id    y
1001  10
1001  11
1200  12
2001  10
2030  12
2100  32
3100  10
3190  13
4100  45
5100  67
5670  56
...
10001  54
10345  45
11234  32
and so on

My purpose is the find the sum of the y for the id starting with 1 (ie1001,1200,..), 2(2100), 3(3100,3190), 4,...10,11,...65. For example, for id starting with 1 , the sum is 10+11+12=33, for id starting with 2, it is 32.

I wanted to use the apply function which looks like as follows:

>s <- split(z,z$id)
>lapply(s, function(x) colSums(x[, c("y")]))

However, this gives me the sum by each of the unique id, not the one as I was looking for. Any suggestion in this regard would be highly appreciated.

Here is a data.table solution that uses %/% to perform integer division (return how many thousands)

library(data.table)
DT <- data.table(z)

x <- DT[,list(sum_y = sum(y)), by = list(id = id %/% 1000)]
x
   id sum_y
1:  1    33
2:  2    54
3:  3    23
4:  4    45
5:  5   123
6: 10    99

You could do the similar with ddply

ddply(z, .(id = id %/% 1000 ), summarize, sum_y = sum(y))
  id sum_y
1  1    33
2  2    54
3  3    23
4  4    45
5  5   123
6 10    99

Does this give you the intended answer?

z <- read.table(textConnection("id y
1001 10
1001 11
1200 12
2001 10
2030 12
2100 32
3100 10
3190 13
4100 45
5100 67
5670 56
10001 54
10345 45"),header=TRUE)

result <- tapply(
                 z$y,
                 as.numeric(substr(z$id,1,nchar(z$id)-3)),
                 sum
                )

result
  1   2   3   4   5  10 
 33  54  23  45 123  99 

To steal @mnel's line from above, this could be simplified to:

result <- tapply(
                 z$y,
                 z$id %/% 1000,
                 sum
                )

thelatemail provides a valid approach but I want to point out the problem isn't really with your understanding of lapply (your code was almost correct) but with thinking about grouping. thelatemail does this in his solution and that's the key. I'm going to show you with your approach and then how I would actually approach this and then using ave just because I never get to use it :)

Read in data

z <- read.table(textConnection("id y #stole this from the latemail
1001 10
1001 11
1200 12
2001 10
2030 12
2100 32
3100 10
3190 13
4100 45
5100 67
5670 56
10001 54
10345 45"),header=TRUE)

Your code adjusted

s <- split(z, substring(as.character(z$id), 1, nchar(as.character(z$id)) - 3))
lapply(s, function(x) sum(x[, "y"]))

Approach I would likely take; add a new factor id variable

z$IDgroup <- substring(as.character(z$id), 1, nchar(as.character(z$id)) - 3)
aggregate(y ~ IDgroup, z, sum)
#similar approach but adds the solution back as a new column
z$group.sum <- ave(z$y, z$IDgroup, FUN=sum)
z

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM