简体   繁体   English

根据其他列与df中其他行的关系更改一行中的df列值

[英]Changing a df column value in a row based on other columns' relationship to other rows in df

Working in R 3.1.1. 在R 3.1.1中工作。

I have a dataset with transaction data. 我有一个包含交易数据的数据集。 Each customer has bought at least twice (I've subseted my original data). 每个客户至少购买了两次(我将原始数据进行了子集化)。 What I would like to do, is flag each transaction as a "first time buyer" transaction or "repeat buyer" transaction. 我想做的是将每笔交易标记为“首次购买者”交易或“重复购买者”交易。 The issue is, I would like to define a "repeat buyer transaction" as one within a certain time frame of a past transaction, so it's not quite as simple as flagging the first one ever for each customer as "first" and the rest as "repeat". 问题是,我想将“重复购买者交易”定义为过去交易的特定时间范围内的交易,因此它并不像将每个客户的第一个交易标记为“第一个”,而将其余的标记为“第一个”那样简单“重复”。 If a customer hasn't bought in more than 1 year (52.25 weeks, I want him/her to be counted as first time!) 如果客户超过一年没有购物(52.25周,我希望他/她被视为首次!)

The best way of accomplishing this that I've been able to come up with is extremely inefficient, I think (full disclosure, still running, so it may be erroneous to boot). 我认为,实现这一目标的最好方法是效率极低(完全公开,仍在运行,因此启动可能是错误的)。 I'm using nested for loops... :( 我正在使用嵌套的循环... :(

Any suggestions on how to accomplish this more efficiently? 关于如何更有效地完成此操作的任何建议? Thanks in advance for your help and suggestions! 在此先感谢您的帮助和建议! Code is commented throughout so I'll let it speak for itself, but please do let me know if it's not clear! 整个代码都带有注释,因此我会让它自己说出来,但是如果不清楚,请告诉我!

 #let's ensure the repdata is ordered by date first attach(repdata) repdata <- repdata[order(date),] detach(repdata) #now, we loop through repdata and decide whether purchase #is a first time or repeat buyer #setting time frame to 1 year (52.25 weeks as we use week as units below) timeframe = 52.25 #add new column to repdata that we will use below repdata$rpt52wk <- "" #for each row in repdata, do the following for(i in seq_along(repdata$date)) { #assume that this is a first purchase; set rpt52wk var for [i] to "FIRST TIME BUYER" repdata$rpt52wk[i] = "FIRST TIME BUYER" #look at all previous transactions #we can ignore higher indexed transactions (we sorted the data, ascending by date) for (j in seq_along(repdata$date[1:(i-1)])) { #if a transaction is found in which the same member bought within the timeframe else if(repdata$MEMBER_ID[i] == repdata$MEMBER_ID[j] & (difftime(repdata$date[i],repdata$date[j],units="weeks")<timeframe)) { #then this is a repeat buyer; set rpt var for [i] appropriately repdata$rpt52wk[i]="REPEAT BUYER" } } } 

Adding test data that fails, at least when run on my side with the two solutions presented so far. 添加失败的测试数据,至少在到目前为止介绍的两种解决方案都支持的情况下。

MEMBER_ID       date
      1 2011-04-13
      2 2011-04-22
      3 2011-04-17
      3 2011-04-26
      4 2011-04-13
      4 2011-04-16
      4 2011-04-16
      5 2011-04-20
      5 2011-04-13
      5 2011-04-18
      6 2011-04-13
      7 2011-04-13
      8 2011-04-25
      8 2011-04-20
      9 2011-04-14
     10 2011-04-14
     11 2011-04-18
     12 2011-04-15
     13 2011-04-15
     14 2011-04-13

#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC", 
          "2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC", 
          "2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC", 
          "2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC", 
          "2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata

(Note that I realize that the code has a bug for i=1. I'm going to ignore it for now in favour of not adding another if statement inside my for loop) (请注意,我意识到代码中存在i = 1的错误。为了避免在for循环中添加另一个if语句,我现在将其忽略)

You could give it a try using ddply. 您可以尝试使用ddply。

First generate a dataset sorted by date, with a timeframe of 52 weeks. 首先生成按日期排序的数据集,时间范围为52周。

#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC", 
          "2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC", 
          "2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC", 
          "2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC", 
          "2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata <- repdata[order(repdata$date),]
repdata

# define a timeframe of 4 weeks
timeframe <- as.difftime(52, units = "weeks")

Then tun the following code : 然后调整以下代码:

library(plyr)

first.buyers <- ddply(repdata, .(MEMBER_ID),
                  function(x) x[c(TRUE, diff(x$date) > timeframe),])
first.buyers <- mutate(first.buyers, rpt52wk = "FIRST TIME BUYER")

final <- merge(repdata,first.buyers, all = TRUE)
final[is.na(final$rpt52wk),"rpt52wk"] <- "REPEAT BUYER"

We get the following result : 我们得到以下结果:

   MEMBER_ID       date          rpt52wk
1          1 2011-04-13 FIRST TIME BUYER
2          2 2011-04-22 FIRST TIME BUYER
3          3 2011-04-17 FIRST TIME BUYER
4          3 2011-04-26     REPEAT BUYER
5          4 2011-04-13 FIRST TIME BUYER
6          4 2011-04-16     REPEAT BUYER
7          4 2011-04-16     REPEAT BUYER
8          5 2011-04-13 FIRST TIME BUYER
9          5 2011-04-18     REPEAT BUYER
10         5 2011-04-20     REPEAT BUYER
11         6 2011-04-13 FIRST TIME BUYER
12         7 2011-04-13 FIRST TIME BUYER
13         8 2011-04-20 FIRST TIME BUYER
14         8 2011-04-25     REPEAT BUYER
15         9 2011-04-14 FIRST TIME BUYER
16        10 2011-04-14 FIRST TIME BUYER
17        11 2011-04-18 FIRST TIME BUYER
18        12 2011-04-15 FIRST TIME BUYER
19        13 2011-04-15 FIRST TIME BUYER
20        14 2011-04-13 FIRST TIME BUYER

ddply splits your dataframe by MEMBER_ID, and apply a function to each subset. ddply按MEMBER_ID拆分数据帧,并将函数应用于每个子集。 Each subset is a dataframe with fixed MEMBER_ID and ordered date. 每个子集都是一个具有固定MEMBER_ID和有序日期的数据帧。 The first element will always correspond to a first buyer, for the next elements you have to determine if the time elapsed since the last transaction is larger than your threshold (If yes, this member can be again consider as a first buyer). 第一个元素将始终对应于第一个购买者,对于下一个元素,您必须确定自上次交易以来经过的时间是否大于您的阈值(如果是,则该成员可以再次考虑为第一个购买者)。

In the code above you should check that time units are consistent when doing the comparison diff(x$date) > timeframe (depends on your date format) 在上面的代码中,进行比较diff(x $ date)>时间范围时,您应检查时间单位是否一致(取决于日期格式)

Once you have found the first time buyers I think the next steps are rather explicit. 一旦您找到了第一次购买者,我认为下一步是相当明确的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM