I have a very large dataset that looks simplified like this:
row. member_id entry_id comment_count timestamp
1 1 a 4 2008-06-09 12:41:00
2 1 b 1 2008-07-14 18:41:00
3 1 c 3 2008-07-17 15:40:00
4 2 d 12 2008-06-09 12:41:00
5 2 e 50 2008-09-18 10:22:00
6 3 f 0 2008-10-03 13:36:00
I now want to create new a column, in which I sum up all the "commen_count" of previous ideas ("ID") from the same member. So I only want to sum up the comment_counts from entried that occured before the current entry. I can order my dataset by the member id and the timestamp.
The result should look something like this:
row. member_id entry_id comments_count timestamp aggregated_count
1 1 a 4 2008-06-09 12:41:00 4
2 1 b 1 2008-07-14 18:41:00 5
3 1 c 3 2008-07-17 15:40:00 8
4 2 d 12 2008-06-09 12:41:00 12
5 2 e 50 2008-09-18 10:22:00 62
6 3 f 0 2008-10-03 13:36:00 0
Some idea how I can do this in R (or Stata)? I tried aggregate, but I don't understand how to sum only the comment_counts before the curren timestamp and only those with the current member_id.
Try this (assuming df
is your data)
transform(df, aggregated_count = ave(comments_count, member_id, FUN = cumsum))
# member_id entry_id comments_count timestamp aggregated_count
# 1 1 a 4 2008-06-09 12:41:00 4
# 2 1 b 1 2008-07-14 18:41:00 5
# 3 1 c 3 2008-07-17 15:40:00 8
# 4 2 d 12 2008-06-09 12:41:00 12
# 5 2 e 50 2008-09-18 10:22:00 62
# 6 3 f 0 2008-10-03 13:36:00 0
Some additional ways (introduced for efficiency gain):
library(data.table)
setDT(df)[, aggregated_count := cumsum(comments_count), member_id]
Or
library(dplyr)
df %>%
group_by(member_id) %>%
mutate(aggregated_count = cumsum(comments_count))
With Stata:
clear
set more off
*----- example data -----
input ///
row member_id str1 entry_id comment_count str30 timestamp
1 1 a 4 2008-06-09 12:41:00
2 1 b 1 2008-07-14 18:41:00
3 1 c 3 2008-07-17 15:40:00
4 2 d 12 2008-06-09 12:41:00
5 2 e 50 2008-09-18 10:22:00
6 3 f 0 2008-10-03 13:36:00
end
list
*----- what you want -----
bysort member_id: gen s = sum(comment_count)
list
This simply involves the use of the by:
prefix.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.