簡體   English   中英

確定重疊范圍 - R.

[英]Determine overlapping ranges - R

我有兩個數據幀。 一個月內購買,一個廣告(廣告)在該月廣播。 要了解購買是否可以與廣告可靠地鏈接 - 我想了解廣告后4天內有多少購買日期。 我創建了一些(繁瑣的)代碼來執行此操作 - 這是基於廣告數據庫的每一行的擴展以覆蓋相關的4天時段 - 然后使用合並構造來查看(缺少)重疊的位置。 這感覺就像是一種非常麻煩的做事方式。 理想情況下 - 我本來希望以優雅的方式在dplyr中這樣做。 如果有人有任何建議,請告訴我

library(dplyr)
library(lubridate)
require(data.table)

# set start and end dates to sample between
day.start <- "2007/01/01"
day.end <- "2007/01/30"

set.seed(1) # define a random date/time selection function
rand.day.time <- function(day.start,day.end,size) {
  dayseq <- seq.Date(as.Date(day.start),as.Date(day.end),by="day")
  dayselect <- sample(dayseq,size,replace=TRUE)
  as.POSIXlt(paste(dayselect) )
}

dateval=rand.day.time(day.start,day.end,size=20)

###create initial dataframes
action=rep(c("ad","purchase"),10)
id=rep(c(1,1,2,2),5)
df=data.frame(customer=id,date=dateval,action=action)
df_pur=filter(df,action=="purchase");(df_pur=df_pur[order(df_pur$date),])
df_ad=filter(df,action=="ad");(df_ad=df_ad[order(df_ad$date),])

#expand data-frame to include all the ranges for which the ad might trigger purchases
df_ad_exp = df_ad %>%
  group_by(customer,date) %>%
  summarize(start=min(date),end=min(date+days(4))) 
df_ad_exp=as.data.frame(df_ad_exp)
df_ad_exp2=setDT(df_ad_exp)[, list(customer=customer, range=seq(start,end,by="day")), by=1:nrow(df_ad_exp)]

###merge the dataframe, use NA values to identify those dates in which purchase was made but no ad was "active"
df_ad_exp2=as.data.frame(df_ad_exp2)
(df_ad_exp2=df_ad_exp2[,c("customer","range")])
df_ad_exp2$helpercol=0
(df_pur_m=merge(df_pur,df_ad_exp2,by.x=c("date","customer"),by.y=c("range","customer"),all.x=TRUE))

df_pur_m$ad_in_range=df_pur_m$helpercol;df_pur_m$helpercol=NULL
df_pur_m$ad_in_range[!is.na(df_pur_m$ad_in_range)]=1;df_pur_m$ad_in_range[is.na(df_pur_m$ad_in_range)]=0

#outcomes
df_pur
df_ad
df_pur_m

> df_ad
   customer       date action
3         1 2007-01-07     ad
6         2 2007-01-07     ad
1         1 2007-01-08     ad
10        2 2007-01-12     ad
2         2 2007-01-18     ad
5         1 2007-01-19     ad
7         1 2007-01-21     ad
9         1 2007-01-22     ad
8         2 2007-01-24     ad
4         2 2007-01-29     ad
> df_pur_m
         date customer   action ad_in_range
1  2007-01-02        1 purchase           0
2  2007-01-06        2 purchase           0
3  2007-01-12        1 purchase           1
4  2007-01-12        1 purchase           1
5  2007-01-15        2 purchase           1
6  2007-01-20        2 purchase           1
7  2007-01-24        2 purchase           1
8  2007-01-27        1 purchase           0
9  2007-01-28        2 purchase           1
10 2007-01-30        1 purchase           0

嘗試在foverlaps中進行data.table ,它就是為此設計的(我想不出優雅的dplyr方式,對不起)。 您需要在兩個表中都有一個開始/結束日期列,因此廣告的開始/結束日期是4天后的開始日期; 購買的開始/結束日期是相同的。

# df_ad must be keyed
setDT(df_ad)[, ad_date_end:=date + days(4)]
setnames(df_ad, 'date', 'ad_date') # just for readability later
setkey(df_ad, customer, ad_date, ad_date_end)

setDT(df_pur)[, purch_end:=date]
setnames(df_pur, 'date', 'purch_date') # for readability

# type='within': the x interval (purchase) is within the y interval (ad)
# we merge on customer ID, start & end date
ovl <- foverlaps(df_pur, df_ad,
                 by.x=c('customer', 'purch_date', 'purch_end'), type='within') 

#     customer    ad_date action ad_date_end purch_date i.action  purch_end
#  1:        1       <NA>     NA        <NA> 2007-01-02 purchase 2007-01-02
#  2:        2       <NA>     NA        <NA> 2007-01-06 purchase 2007-01-06
#  3:        1 2007-01-08     ad  2007-01-12 2007-01-12 purchase 2007-01-12
#  4:        1 2007-01-08     ad  2007-01-12 2007-01-12 purchase 2007-01-12
#  5:        2 2007-01-12     ad  2007-01-16 2007-01-15 purchase 2007-01-15
#  6:        2 2007-01-18     ad  2007-01-22 2007-01-20 purchase 2007-01-20
#  7:        2 2007-01-24     ad  2007-01-28 2007-01-24 purchase 2007-01-24
#  8:        1       <NA>     NA        <NA> 2007-01-27 purchase 2007-01-27
#  9:        2 2007-01-24     ad  2007-01-28 2007-01-28 purchase 2007-01-28
# 10:        1       <NA>     NA        <NA> 2007-01-30 purchase 2007-01-30
# tidyup
ovl[, action:=i.action][, c('ad_date_end', 'purch_end', 'i.action'):=NULL]
    customer    ad_date   action purch_date
#  1:        1       <NA> purchase 2007-01-02
#  2:        2       <NA> purchase 2007-01-06
#  3:        1 2007-01-08 purchase 2007-01-12
#  4:        1 2007-01-08 purchase 2007-01-12
#  5:        2 2007-01-12 purchase 2007-01-15
#  6:        2 2007-01-18 purchase 2007-01-20
#  7:        2 2007-01-24 purchase 2007-01-24
#  8:        1       <NA> purchase 2007-01-27
#  9:        2 2007-01-24 purchase 2007-01-28
# 10:        1       <NA> purchase 2007-01-30

NA ad_date的行是與廣告ad_date的購買。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM