[英]Flagging data within groups in R dataframe
我重建了你的数据框,尝试这个解决方案:
library(lubridate)
library(dplyr)
df <- data.frame(Person = c(rep("abc",3), rep("eee", 5)),
date = c("4/1/2016", "4/3/2016", "4/12/2016", "5/3/2016", "5/4/2016","5/4/2016","5/6/2016", "5/10/2016"),
account = c("123","123","123","222","222","333","222","333"), stringsAsFactors = F)
df$date2 <- mdy(df$date)
@thelatemail建议的最佳解决方案:
df %>%
group_by(Person) %>%
mutate(keep=as.numeric(date2 - first(date2) <= 4)) %>%
select(-date2)
结果:
Person date account keep
1 abc 4/1/2016 123 1
2 abc 4/3/2016 123 1
3 abc 4/12/2016 123 0
4 eee 5/3/2016 222 1
5 eee 5/4/2016 222 1
6 eee 5/4/2016 333 1
7 eee 5/6/2016 222 1
8 eee 5/10/2016 333 0
我更复杂的原始解决方案(如果帐户创建日期不在每个人的第一行,则非常有用):
df %>%
group_by(Person) %>%
slice(which.min(date2)) %>%
select(Person, date2) %>%
rename(account_create = date2) %>%
merge(df, ., by = "Person") %>%
mutate(keep = as.numeric(date2 - account_create <= 4)) %>%
select(-c(date2, account_create))
使用data.table
:
library(data.table)
setDT(df)[, Keep:=as.numeric(difftime(date,first(date),units="days") < 4), by=Person][]
我们按Person
,然后创建柱Keep
使用状态的date
小于4
从天first(date)
的Person
。
在这里,我们假设date
列是date-time
对象。 如果将date
列作为字符串读入,那么我们可以使用以下命令进行转换:
df$date <- as.POSIXct(df$date, format="%m/%d/%Y")
随着数据给出:
df <- structure(list(Person = c("abc", "abc", "abc", "eee", "eee",
"eee", "eee", "eee"), date = structure(c(1459483200, 1459656000,
1460433600, 1462248000, 1462334400, 1462334400, 1462507200, 1462852800
), class = c("POSIXct", "POSIXt"), tzone = ""), account = c(123L,
123L, 123L, 222L, 222L, 333L, 222L, 333L)), .Names = c("Person",
"date", "account"), row.names = c(NA, -8L), class = "data.frame")
结果是:
## Person date account Keep
##1 abc 2016-04-01 123 1
##2 abc 2016-04-03 123 1
##3 abc 2016-04-12 123 0
##4 eee 2016-05-03 222 1
##5 eee 2016-05-04 222 1
##6 eee 2016-05-04 333 1
##7 eee 2016-05-06 222 1
##8 eee 2016-05-10 333 0
感谢这些伟大的想法; R是惊人的,在四行代码中进行相对复杂的会计。 我没有强调的另一件事是我还需要跟踪它是否是一个新帐户。 此外,由于这些数据不一定排序,我先排序,所以这是最终版本。
df %>%
arrange(Person,account) %>%
group_by(Person,account) %>%
mutate(keep=as.numeric(date2 - first(date2) <4)) %>%
select(-date2)
结果:
Person date account keep
<chr> <chr> <chr> <dbl>
1 abc 4/1/2016 123 1
2 abc 4/3/2016 123 1
3 abc 4/12/2016 123 0
4 eee 5/3/2016 222 1
5 eee 5/4/2016 222 1
6 eee 5/6/2016 222 1
7 eee 5/10/2016 333 1
8 eee 5/11/2016 333 1
所以我们保留最后一行,因为距离333帐户首次出现仅一天。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.