I am fairly new to R (new to this site as well) and trying to understand how to aggregate data across columns in a situation where there is more than 1 identifier (in this case, two: PERSON_ID and PRODUCT_ID).
Please see my example below. To the right of the two identifiers within my data frame are five columns containing weekly sales figures. I need to aggregate the weekly data so that:
1: Week columns with the same name are summed (typically this is something I can easily accomplish in MS Excel using the sumif/sumifs function)
2: Any rows containing the same PERSON_ID and PRODUCT_ID combination are summed as well.
In this particular case, notice that the week of 6/2/2017 appears in more than one column. Meanwhile, PERSON_ID 0003603 appears twice for the same PRODUCT_ID, 3024.
PERSON_ID PRODUCT_ID 6/23/2017 6/16/2017 6/9/2017 6/2/2017 6/2/2017
0003603 3024 10.000 5.000 4.000 3.000 2.000
0003603 3024 1.000 2.000 3.000 8.000 1.000
0007654 2111 8.000 3.000 2.000 1.000 0.000
0008885 3025 0.000 0.000 1.000 3.000 9.000
0950645 3024 6.000 5.000 4.000 3.000 2.000
My actual data frame contains in excess of 1 million records, so an approach using the data.table package would be ideal, as far as I can tell.
Can someone please shed some light on how to solve this particular problem in R?
melt
ing your data (reshaping long) is the way to go. If I understand what you're after correctly, it's simply:
x = fread('PERSON_ID PRODUCT_ID 6/23/2017 6/16/2017 6/9/2017 6/2/2017 6/2/2017
0003603 3024 10.000 5.000 4.000 3.000 2.000
0003603 3024 1.000 2.000 3.000 8.000 1.000
0007654 2111 8.000 3.000 2.000 1.000 0.000
0008885 3025 0.000 0.000 1.000 3.000 9.000
0950645 3024 6.000 5.000 4.000 3.000 2.000',
colClasses = c('character', 'character', rep('numeric', 5L)))
xmlt =
melt(x, id.vars = c('PERSON_ID', 'PRODUCT_ID'),
variable.name = 'week', value.name = 'sales')
xmlt[ , week := as.IDate(week, format = '%m/%d/%Y')]
xmlt[ , .(total_sales = sum(sales)),
keyby = .(PERSON_ID, PRODUCT_ID, week)]
PERSON_ID PRODUCT_ID week total_sales
# 1: 0003603 3024 2017-06-02 14
# 2: 0003603 3024 2017-06-09 7
# 3: 0003603 3024 2017-06-16 7
# 4: 0003603 3024 2017-06-23 11
# 5: 0007654 2111 2017-06-02 1
# 6: 0007654 2111 2017-06-09 2
# 7: 0007654 2111 2017-06-16 3
# 8: 0007654 2111 2017-06-23 8
# 9: 0008885 3025 2017-06-02 12
# 10: 0008885 3025 2017-06-09 1
# 11: 0008885 3025 2017-06-16 0
# 12: 0008885 3025 2017-06-23 0
# 13: 0950645 3024 2017-06-02 5
# 14: 0950645 3024 2017-06-09 4
# 15: 0950645 3024 2017-06-16 5
# 16: 0950645 3024 2017-06-23 6
We first define df
as follows. Note that column names in R
cannot start with a number and cannot have duplicates. R
rectifies these by adding X
at the front of column names that start with a number and appending .1
, .2
etc. at the end of column names for duplicates.
df <- read.table(text = "
PERSON_ID PRODUCT_ID 6/23/2017 6/16/2017 6/9/2017 6/2/2017 6/2/2017
0003603 3024 10.000 5.000 4.000 3.000 2.000
0003603 3024 1.000 2.000 3.000 8.000 1.000
0007654 2111 8.000 3.000 2.000 1.000 0.000
0008885 3025 0.000 0.000 1.000 3.000 9.000
0950645 3024 6.000 5.000 4.000 3.000 2.000",
header = TRUE, colClasses = rep(c("character", "numeric"), c(2,5)))
We can use dplyr
(data manipulation), tidyr
(tidy data) and lubridate
(working with dates) packages to solve the problem.
library(dplyr)
library(tidyr)
library(lubridate)
library(rebus)
df %>%
gather(DATE, SALES, -c(PERSON_ID, PRODUCT_ID)) %>%
mutate(DATE = str_extract(DATE, pattern = repeated(DGT, 1, 2) %R% DOT %R%
repeated(DGT, 1, 2) %R% DOT %R%
repeated(DGT, 4, 4)),
DATE = mdy(DATE)) %>%
group_by(PERSON_ID, PRODUCT_ID, DATE) %>%
summarise_at(vars(SALES), funs(sum)) %>%
ungroup
The code is written in the following manner:
df
is converted into a long format. This is to ensure that the rows in the dataframe are observations and the columns are variables. DATE
variable to get rid of the prefix X
and suffix .1
and coerce date class (month-day-year) to the variable. PERSON_ID
, PRODUCT_ID
, DATE
SALES
variable per group (as defined in the previous point) If you want to convert it back to the wide format, you may add another line, ie %>% spread(DATE, SALES)
at the end.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.