I have the following table in R dataframe
I would like to write the logic that generates the "keep" column. For each person I would like to flag accounts that has a transaction newer than 4 days, since first access. So the first line is a new account for this person so flag it. The second line the dates are only 2 days apart so keep it too. The third line is 11 days since we first saw this account so we do NOT flag it. The same logic goes for the next person. Flag only accounts that is less than 4 days old.
I have rebuilt your data frame, try this solution:
library(lubridate)
library(dplyr)
df <- data.frame(Person = c(rep("abc",3), rep("eee", 5)),
date = c("4/1/2016", "4/3/2016", "4/12/2016", "5/3/2016", "5/4/2016","5/4/2016","5/6/2016", "5/10/2016"),
account = c("123","123","123","222","222","333","222","333"), stringsAsFactors = F)
df$date2 <- mdy(df$date)
The best solution, as suggested by @thelatemail:
df %>%
group_by(Person) %>%
mutate(keep=as.numeric(date2 - first(date2) <= 4)) %>%
select(-date2)
Result:
Person date account keep
1 abc 4/1/2016 123 1
2 abc 4/3/2016 123 1
3 abc 4/12/2016 123 0
4 eee 5/3/2016 222 1
5 eee 5/4/2016 222 1
6 eee 5/4/2016 333 1
7 eee 5/6/2016 222 1
8 eee 5/10/2016 333 0
My more convoluted original solution (useful if the account creation date is not in the first line for each person):
df %>%
group_by(Person) %>%
slice(which.min(date2)) %>%
select(Person, date2) %>%
rename(account_create = date2) %>%
merge(df, ., by = "Person") %>%
mutate(keep = as.numeric(date2 - account_create <= 4)) %>%
select(-c(date2, account_create))
Using data.table
:
library(data.table)
setDT(df)[, Keep:=as.numeric(difftime(date,first(date),units="days") < 4), by=Person][]
We group by Person
and then create the column Keep
using the condition that the date
is less than 4
days from the first(date)
for the Person
.
Here, we assume that the date
column is a date-time
object. If the date
column is read in as character strings, then we can do the conversion using:
df$date <- as.POSIXct(df$date, format="%m/%d/%Y")
With the data given by:
df <- structure(list(Person = c("abc", "abc", "abc", "eee", "eee",
"eee", "eee", "eee"), date = structure(c(1459483200, 1459656000,
1460433600, 1462248000, 1462334400, 1462334400, 1462507200, 1462852800
), class = c("POSIXct", "POSIXt"), tzone = ""), account = c(123L,
123L, 123L, 222L, 222L, 333L, 222L, 333L)), .Names = c("Person",
"date", "account"), row.names = c(NA, -8L), class = "data.frame")
The result is:
## Person date account Keep
##1 abc 2016-04-01 123 1
##2 abc 2016-04-03 123 1
##3 abc 2016-04-12 123 0
##4 eee 2016-05-03 222 1
##5 eee 2016-05-04 222 1
##6 eee 2016-05-04 333 1
##7 eee 2016-05-06 222 1
##8 eee 2016-05-10 333 0
Thanks for these great ideas; R is amazing, doing this relatively complicated accounting in four lines of code. Another thing I did not emphasize is that I also need to keep track whether it is a new account or not. Also since this data is not necessarily sorted, I sorted it first, so here is the final version.
df %>%
arrange(Person,account) %>%
group_by(Person,account) %>%
mutate(keep=as.numeric(date2 - first(date2) <4)) %>%
select(-date2)
Result:
Person date account keep
<chr> <chr> <chr> <dbl>
1 abc 4/1/2016 123 1
2 abc 4/3/2016 123 1
3 abc 4/12/2016 123 0
4 eee 5/3/2016 222 1
5 eee 5/4/2016 222 1
6 eee 5/6/2016 222 1
7 eee 5/10/2016 333 1
8 eee 5/11/2016 333 1
So we keep the last line since it is only 1 day from when the 333 account first showed up.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.