简体   繁体   中英

Efficient way to mutate rowwise

I have two dataframes: dfUsers and purchases generated using the below code:

set.seed(1)
library(data.table)

dfUsers <- data.table(user = letters[1:5],
                      startDate = sample(seq.Date(from = as.Date('2016-01-01'), to = Sys.Date(), by = '1 day'), 3)
                      )

dfUsers$endDate <- dfUsers$startDate + sample(30:90,1)

purchases <- data.table(
  user = sample(letters[1:5], 500, replace = TRUE),
  purchaseDate = sample(seq.Date(from = as.Date('2016-01-01'), to = Sys.Date(), by = '1 day'), 500, replace = TRUE),
  amount = runif(50,300, 500)
)

For each user I want to add together all the purchases during the period between the startDate and endDate.

My current approach is to use dplyr mutate over a function, but that's terribly slow as both tables grow.

I'm learning R so I'm wondering if there's a more efficient way to approach a problem of this nature?

The function:

addPurchases <- function(u, startDate, endDate) {
  purchases[user == u & startDate <= purchaseDate & endDate >= purchaseDate, sum(amount)]
}

The dplyr chain

library(dplyr)
dfUsers %>% 
  rowwise() %>%
  mutate(totalPurchase = addPurchases(user, startDate, endDate))

The fast, clean and memory efficient solution is to use non-equi joins.

purchases[dfUsers, on = .(user, purchaseDate >= startDate, purchaseDate <= endDate),
          sum(amount), by = .EACHI]
#   user purchaseDate purchaseDate       V1
#1:    a   2016-07-06   2016-09-29 6929.469
#2:    b   2016-09-20   2016-12-14 6563.416
#3:    c   2017-02-08   2017-05-04 3607.794
#4:    d   2016-07-06   2016-09-29 5591.748
#5:    e   2016-09-20   2016-12-14 5727.622

A solution using the dplyr . The idea is to merge data frames by user , filter the data by date, and then summarize the total amount by user .

library(dplyr)
dfUsers2 <- dfUsers %>%
  full_join(purchases, by = "user") %>%
  filter(purchaseDate >= startDate, purchaseDate <= endDate) %>%
  group_by(user) %>%
  summarise(Total = sum(amount, na.rm = TRUE))
dfUsers2
# # A tibble: 5 x 2
#    user    Total
#   <chr>    <dbl>
# 1     a 6929.469
# 2     b 6563.416
# 3     c 3607.794
# 4     d 5591.748
# 5     e 5727.622

Solution using data.table - merge two tables and calculate sum by user :

library(data.table)
# Using OPs data
merge(dfUsers, 
      purchases, 
      "user")[purchaseDate >= startDate & purchaseDate <= endDate, 
              sum(amount), 
              user]
#    user       V1
# 1:    a 6929.469
# 2:    b 6563.416
# 3:    c 3607.794
# 4:    d 5591.748
# 5:    e 5727.622

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM