簡體   English   中英

使用 dplyr 代替 lapply

[英]Using dplyr instead of lapply

我有一個包含一堆startend日期的dataframe ,我正在遍歷一個日期列表,並查看在列表中的那個日期期間我的數據框中有多少行是“打開的”(即開始日期已經發生但結束日期沒有)。

我curently這樣使用lapply但我想知道,如果它可以在完成dplyr來代替,而如果在內存和速度(實際數據幀是150萬行)方面的任何益處。

      RollingDateRange <- seq(Sys.Date()-15, Sys.Date(), by="days")
      temp <- data.frame(RollingDateRange)

      dat <- data.frame(
        Order = c(1,1,1,2,2,2,3,3,3), 
        Code = c("Green","Yellow","Blue","Yellow","Yellow","Red","Purple","Green","Blue"),
        Start.Date = as.Date(c("2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03")),
        End.Date = as.Date(c("2020-02-02","2020-02-08",NA,"2020-02-07","2020-02-06",NA,"2020-02-03","2020-02-08","2020-02-06")),
        Count = c(1,1,1,1,1,1,1,1,1),
        stringsAsFactors = FALSE)

      temp$Count <- lapply(temp$RollingDateRange, function(d){
        b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) | ((dat$Start.Date <= d) & (is.na(dat$End.Date))),]

        total <- sum(b$Count, na.rm = TRUE)
      })

輸出:

> temp
   RollingDateRange Count
1        2020-01-25     0
2        2020-01-26     0
3        2020-01-27     0
4        2020-01-28     0
5        2020-01-29     0
6        2020-01-30     0
7        2020-01-31     0
8        2020-02-01     3
9        2020-02-02     6
10       2020-02-03     8
11       2020-02-04     7
12       2020-02-05     7
13       2020-02-06     7
14       2020-02-07     5
15       2020-02-08     4
16       2020-02-09     2

考慮帶有矢量索引的vapply ,這可能會減少lapply處理。 具體地,與lapply返回一個列表, sapply ,默認情況下返回的載體, vapply (類似於sapply )返回所定義的類型和長度的特定載體:

temp$Count <- vapply(temp$RollingDateRange, function(d){
   # LOGICAL INDEXING OF VECTOR (I.E., ONLY "COUNT" COLUMN)
   b <- with(dat, dat$Count[((Start.Date <= d) & (End.Date >= d)) | 
                            ((Start.Date <= d) & (is.na(End.Date)))])

   total <- sum(b, na.rm = TRUE)
}, numeric(1))

您的簡單示例顯示了時間上的顯着差異:

system.time( {
    temp$Count <- lapply(temp$RollingDateRange, function(d){
        # LOGICAL INDEXING OF DATA FRAME RETURNING ALL COLUMNS
        b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) | 
                 ((dat$Start.Date <= d) & (is.na(dat$End.Date))),]

        total <- sum(b$Count, na.rm = TRUE)
    })

})

#    user  system elapsed 
#   0.003   0.000   0.005 

system.time( {
    temp$Count <- vapply(temp$RollingDateRange, function(d){
        # LOGICAL INDEXING OF VECTOR (I.E., ONLY "COUNT" COLUMN)
        b <- with(dat, dat$Count[((Start.Date <= d) & (End.Date >= d)) | 
                                 ((Start.Date <= d) & (is.na(End.Date)))])

        total <- sum(b, na.rm = TRUE)
    }, numeric(1))
})

#    user  system elapsed 
#   0.001   0.000   0.001 

比較其他建議的解決方案,這些解決方案可能因機器和軟件包版本而異。

# @akrun's SOLUTION
system.time( {
  temp %>% 
    pull(RollingDateRange) %>%
    map_dfr(~ 
              dat %>%
              filter((Start.Date <= .x & End.Date >= .x)|
                     (Start.Date <= .x & is.na(End.Date))) %>% 
              pull(Count) %>% 
              sum %>% 
              tibble(RollingDateRange = .x, Count = .))
})

#    user  system elapsed 
#   0.029   0.000   0.029 


# @RonakShah's SOLUTION
system.time({
  temp %>%
    mutate(Count = purrr::map_dbl(RollingDateRange, ~ with(dat, 
                 sum(Count[(Start.Date <= .x & End.Date >= .x) | 
                           (Start.Date <= .x & is.na(End.Date))], na.rm = TRUE))))

})

#    user  system elapsed 
#   0.002   0.000   0.001 

我們可以使用map_dblpurrr來計算滿足條件的Count值的總和。

library(dplyr)

temp %>%
  mutate(Count = purrr::map_dbl(RollingDateRange, ~ with(dat, 
                 sum(Count[(Start.Date <= .x & End.Date >= .x) | 
                           (Start.Date <= .x & is.na(End.Date))], na.rm = TRUE))))

#   RollingDateRange Count
#1        2020-01-25     0
#2        2020-01-26     0
#3        2020-01-27     0
#4        2020-01-28     0
#5        2020-01-29     0
#6        2020-01-30     0
#7        2020-01-31     0
#8        2020-02-01     3
#9        2020-02-02     6
#10       2020-02-03     8
#11       2020-02-04     7
#12       2020-02-05     7
#13       2020-02-06     7
#14       2020-02-07     5
#15       2020-02-08     4
#16       2020-02-09     2

如果我們想要 tidyverse 方法,請使用map

library(dplyr)
library(purrr)
temp %>% 
    pull(RollingDateRange) %>%
    map_dfr(~ 
          dat %>%
              filter((Start.Date <= .x & End.Date >= .x)|
               (Start.Date <= .x & is.na(End.Date))) %>% 
              pull(Count) %>% 
              sum %>% 
              tibble(RollingDateRange = .x, Count = .))
# A tibble: 16 x 2
#   RollingDateRange Count
#   <date>           <dbl>
# 1 2020-01-25           0
# 2 2020-01-26           0
# 3 2020-01-27           0
# 4 2020-01-28           0
# 5 2020-01-29           0
# 6 2020-01-30           0
# 7 2020-01-31           0
# 8 2020-02-01           3
# 9 2020-02-02           6
#10 2020-02-03           8
#11 2020-02-04           7
#12 2020-02-05           7
#13 2020-02-06           7
#14 2020-02-07           5
#15 2020-02-08           4
#16 2020-02-09           2

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM