[英]Using dplyr instead of lapply
我有一個包含一堆start
和end
日期的dataframe
,我正在遍歷一個日期列表,並查看在列表中的那個日期期間我的數據框中有多少行是“打開的”(即開始日期已經發生但結束日期沒有)。
我curently這樣使用lapply
但我想知道,如果它可以在完成dplyr
來代替,而如果在內存和速度(實際數據幀是150萬行)方面的任何益處。
RollingDateRange <- seq(Sys.Date()-15, Sys.Date(), by="days")
temp <- data.frame(RollingDateRange)
dat <- data.frame(
Order = c(1,1,1,2,2,2,3,3,3),
Code = c("Green","Yellow","Blue","Yellow","Yellow","Red","Purple","Green","Blue"),
Start.Date = as.Date(c("2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03")),
End.Date = as.Date(c("2020-02-02","2020-02-08",NA,"2020-02-07","2020-02-06",NA,"2020-02-03","2020-02-08","2020-02-06")),
Count = c(1,1,1,1,1,1,1,1,1),
stringsAsFactors = FALSE)
temp$Count <- lapply(temp$RollingDateRange, function(d){
b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) | ((dat$Start.Date <= d) & (is.na(dat$End.Date))),]
total <- sum(b$Count, na.rm = TRUE)
})
輸出:
> temp
RollingDateRange Count
1 2020-01-25 0
2 2020-01-26 0
3 2020-01-27 0
4 2020-01-28 0
5 2020-01-29 0
6 2020-01-30 0
7 2020-01-31 0
8 2020-02-01 3
9 2020-02-02 6
10 2020-02-03 8
11 2020-02-04 7
12 2020-02-05 7
13 2020-02-06 7
14 2020-02-07 5
15 2020-02-08 4
16 2020-02-09 2
考慮帶有矢量索引的vapply
,這可能會減少lapply
處理。 具體地,與lapply
返回一個列表, sapply
,默認情況下返回的載體, vapply
(類似於sapply
)返回所定義的類型和長度的特定載體:
temp$Count <- vapply(temp$RollingDateRange, function(d){
# LOGICAL INDEXING OF VECTOR (I.E., ONLY "COUNT" COLUMN)
b <- with(dat, dat$Count[((Start.Date <= d) & (End.Date >= d)) |
((Start.Date <= d) & (is.na(End.Date)))])
total <- sum(b, na.rm = TRUE)
}, numeric(1))
您的簡單示例顯示了時間上的顯着差異:
system.time( {
temp$Count <- lapply(temp$RollingDateRange, function(d){
# LOGICAL INDEXING OF DATA FRAME RETURNING ALL COLUMNS
b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) |
((dat$Start.Date <= d) & (is.na(dat$End.Date))),]
total <- sum(b$Count, na.rm = TRUE)
})
})
# user system elapsed
# 0.003 0.000 0.005
system.time( {
temp$Count <- vapply(temp$RollingDateRange, function(d){
# LOGICAL INDEXING OF VECTOR (I.E., ONLY "COUNT" COLUMN)
b <- with(dat, dat$Count[((Start.Date <= d) & (End.Date >= d)) |
((Start.Date <= d) & (is.na(End.Date)))])
total <- sum(b, na.rm = TRUE)
}, numeric(1))
})
# user system elapsed
# 0.001 0.000 0.001
比較其他建議的解決方案,這些解決方案可能因機器和軟件包版本而異。
# @akrun's SOLUTION
system.time( {
temp %>%
pull(RollingDateRange) %>%
map_dfr(~
dat %>%
filter((Start.Date <= .x & End.Date >= .x)|
(Start.Date <= .x & is.na(End.Date))) %>%
pull(Count) %>%
sum %>%
tibble(RollingDateRange = .x, Count = .))
})
# user system elapsed
# 0.029 0.000 0.029
# @RonakShah's SOLUTION
system.time({
temp %>%
mutate(Count = purrr::map_dbl(RollingDateRange, ~ with(dat,
sum(Count[(Start.Date <= .x & End.Date >= .x) |
(Start.Date <= .x & is.na(End.Date))], na.rm = TRUE))))
})
# user system elapsed
# 0.002 0.000 0.001
我們可以使用map_dbl
的purrr
來計算滿足條件的Count
值的總和。
library(dplyr)
temp %>%
mutate(Count = purrr::map_dbl(RollingDateRange, ~ with(dat,
sum(Count[(Start.Date <= .x & End.Date >= .x) |
(Start.Date <= .x & is.na(End.Date))], na.rm = TRUE))))
# RollingDateRange Count
#1 2020-01-25 0
#2 2020-01-26 0
#3 2020-01-27 0
#4 2020-01-28 0
#5 2020-01-29 0
#6 2020-01-30 0
#7 2020-01-31 0
#8 2020-02-01 3
#9 2020-02-02 6
#10 2020-02-03 8
#11 2020-02-04 7
#12 2020-02-05 7
#13 2020-02-06 7
#14 2020-02-07 5
#15 2020-02-08 4
#16 2020-02-09 2
如果我們想要 tidyverse 方法,請使用map
library(dplyr)
library(purrr)
temp %>%
pull(RollingDateRange) %>%
map_dfr(~
dat %>%
filter((Start.Date <= .x & End.Date >= .x)|
(Start.Date <= .x & is.na(End.Date))) %>%
pull(Count) %>%
sum %>%
tibble(RollingDateRange = .x, Count = .))
# A tibble: 16 x 2
# RollingDateRange Count
# <date> <dbl>
# 1 2020-01-25 0
# 2 2020-01-26 0
# 3 2020-01-27 0
# 4 2020-01-28 0
# 5 2020-01-29 0
# 6 2020-01-30 0
# 7 2020-01-31 0
# 8 2020-02-01 3
# 9 2020-02-02 6
#10 2020-02-03 8
#11 2020-02-04 7
#12 2020-02-05 7
#13 2020-02-06 7
#14 2020-02-07 5
#15 2020-02-08 4
#16 2020-02-09 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.