简体   繁体   中英

R equivalent of Excel's “Sumif(s)” function across like columns

I am fairly new to R (new to this site as well) and trying to understand how to aggregate data across columns in a situation where there is more than 1 identifier (in this case, two: PERSON_ID and PRODUCT_ID).

Please see my example below. To the right of the two identifiers within my data frame are five columns containing weekly sales figures. I need to aggregate the weekly data so that:

1: Week columns with the same name are summed (typically this is something I can easily accomplish in MS Excel using the sumif/sumifs function)

2: Any rows containing the same PERSON_ID and PRODUCT_ID combination are summed as well.

In this particular case, notice that the week of 6/2/2017 appears in more than one column. Meanwhile, PERSON_ID 0003603 appears twice for the same PRODUCT_ID, 3024.

PERSON_ID    PRODUCT_ID    6/23/2017   6/16/2017   6/9/2017   6/2/2017   6/2/2017
0003603      3024          10.000      5.000       4.000      3.000      2.000
0003603      3024          1.000       2.000       3.000      8.000      1.000     
0007654      2111          8.000       3.000       2.000      1.000      0.000
0008885      3025          0.000       0.000       1.000      3.000      9.000
0950645      3024          6.000       5.000       4.000      3.000      2.000

My actual data frame contains in excess of 1 million records, so an approach using the data.table package would be ideal, as far as I can tell.

Can someone please shed some light on how to solve this particular problem in R?

melt ing your data (reshaping long) is the way to go. If I understand what you're after correctly, it's simply:

x = fread('PERSON_ID    PRODUCT_ID  6/23/2017   6/16/2017   6/9/2017    6/2/2017    6/2/2017
0003603 3024    10.000  5.000   4.000   3.000   2.000
0003603 3024    1.000   2.000   3.000   8.000   1.000
0007654 2111    8.000   3.000   2.000   1.000   0.000
0008885 3025    0.000   0.000   1.000   3.000   9.000
0950645 3024    6.000   5.000   4.000   3.000   2.000',
          colClasses = c('character', 'character', rep('numeric', 5L)))

xmlt = 
  melt(x, id.vars = c('PERSON_ID', 'PRODUCT_ID'),
       variable.name = 'week', value.name = 'sales')

xmlt[ , week := as.IDate(week, format = '%m/%d/%Y')]

xmlt[ , .(total_sales = sum(sales)), 
      keyby = .(PERSON_ID, PRODUCT_ID, week)]
    PERSON_ID PRODUCT_ID       week total_sales
#  1:   0003603       3024 2017-06-02          14
#  2:   0003603       3024 2017-06-09           7
#  3:   0003603       3024 2017-06-16           7
#  4:   0003603       3024 2017-06-23          11
#  5:   0007654       2111 2017-06-02           1
#  6:   0007654       2111 2017-06-09           2
#  7:   0007654       2111 2017-06-16           3
#  8:   0007654       2111 2017-06-23           8
#  9:   0008885       3025 2017-06-02          12
# 10:   0008885       3025 2017-06-09           1
# 11:   0008885       3025 2017-06-16           0
# 12:   0008885       3025 2017-06-23           0
# 13:   0950645       3024 2017-06-02           5
# 14:   0950645       3024 2017-06-09           4
# 15:   0950645       3024 2017-06-16           5
# 16:   0950645       3024 2017-06-23           6

We first define df as follows. Note that column names in R cannot start with a number and cannot have duplicates. R rectifies these by adding X at the front of column names that start with a number and appending .1 , .2 etc. at the end of column names for duplicates.

df <- read.table(text = "
                 PERSON_ID    PRODUCT_ID    6/23/2017   6/16/2017   6/9/2017   6/2/2017   6/2/2017
                 0003603      3024          10.000      5.000       4.000      3.000      2.000
                 0003603      3024          1.000       2.000       3.000      8.000      1.000     
                 0007654      2111          8.000       3.000       2.000      1.000      0.000
                 0008885      3025          0.000       0.000       1.000      3.000      9.000
                 0950645      3024          6.000       5.000       4.000      3.000      2.000",
                 header = TRUE, colClasses = rep(c("character", "numeric"), c(2,5)))

We can use dplyr (data manipulation), tidyr (tidy data) and lubridate (working with dates) packages to solve the problem.

library(dplyr)
library(tidyr)
library(lubridate)
library(rebus)
df %>%
  gather(DATE, SALES, -c(PERSON_ID, PRODUCT_ID)) %>%
  mutate(DATE = str_extract(DATE, pattern = repeated(DGT, 1, 2) %R% DOT %R%
                                            repeated(DGT, 1, 2) %R% DOT %R%
                                            repeated(DGT, 4, 4)),
         DATE = mdy(DATE)) %>%
  group_by(PERSON_ID, PRODUCT_ID, DATE) %>%
  summarise_at(vars(SALES), funs(sum)) %>%
  ungroup

The code is written in the following manner:

  1. The wide df is converted into a long format. This is to ensure that the rows in the dataframe are observations and the columns are variables.
  2. Format DATE variable to get rid of the prefix X and suffix .1 and coerce date class (month-day-year) to the variable.
  3. Group the dataframe by 3 variables, ie PERSON_ID , PRODUCT_ID , DATE
  4. Sum the SALES variable per group (as defined in the previous point)

If you want to convert it back to the wide format, you may add another line, ie %>% spread(DATE, SALES) at the end.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM