在R数据帧中标记组内的数据

Question

我在R数据帧中有下表

我想编写生成“保持”列的逻辑。 对于每个人，我想标记自首次访问以来交易时间超过4天的帐户。 所以第一行是这个人的新帐户，所以标记它。 第二行日期只有2天，所以保持它。 第三行是我们第一次看到这个帐号后的11天，所以我们不会标记它。 同样的逻辑适用于下一个人。 仅标记不到4天的帐户。

Answer 1

我重建了你的数据框，尝试这个解决方案：

library(lubridate)
library(dplyr)

df <- data.frame(Person = c(rep("abc",3), rep("eee", 5)),
           date = c("4/1/2016", "4/3/2016", "4/12/2016", "5/3/2016", "5/4/2016","5/4/2016","5/6/2016", "5/10/2016"),
           account = c("123","123","123","222","222","333","222","333"), stringsAsFactors = F)

df$date2 <- mdy(df$date)

@thelatemail建议的最佳解决方案：

df %>% 
group_by(Person) %>% 
mutate(keep=as.numeric(date2 - first(date2) <= 4)) %>% 
select(-date2)

结果：

 Person      date account keep
1    abc  4/1/2016     123    1
2    abc  4/3/2016     123    1
3    abc 4/12/2016     123    0
4    eee  5/3/2016     222    1
5    eee  5/4/2016     222    1
6    eee  5/4/2016     333    1
7    eee  5/6/2016     222    1
8    eee 5/10/2016     333    0

我更复杂的原始解决方案（如果帐户创建日期不在每个人的第一行，则非常有用）：

df %>% 
group_by(Person) %>% 
slice(which.min(date2)) %>%
select(Person, date2) %>%
rename(account_create = date2) %>%
merge(df, ., by = "Person") %>%
mutate(keep = as.numeric(date2 - account_create <= 4)) %>%
select(-c(date2, account_create))

Answer 2

使用data.table ：

library(data.table)
setDT(df)[, Keep:=as.numeric(difftime(date,first(date),units="days") < 4), by=Person][]

我们按Person ，然后创建柱Keep使用状态的date小于4从天first(date)的Person 。

在这里，我们假设date列是date-time对象。 如果将date列作为字符串读入，那么我们可以使用以下命令进行转换：

df$date <- as.POSIXct(df$date, format="%m/%d/%Y")

随着数据给出：

df <- structure(list(Person = c("abc", "abc", "abc", "eee", "eee", 
"eee", "eee", "eee"), date = structure(c(1459483200, 1459656000, 
1460433600, 1462248000, 1462334400, 1462334400, 1462507200, 1462852800
), class = c("POSIXct", "POSIXt"), tzone = ""), account = c(123L, 
123L, 123L, 222L, 222L, 333L, 222L, 333L)), .Names = c("Person", 
"date", "account"), row.names = c(NA, -8L), class = "data.frame")

结果是：

##  Person       date account  Keep
##1    abc 2016-04-01     123     1
##2    abc 2016-04-03     123     1
##3    abc 2016-04-12     123     0
##4    eee 2016-05-03     222     1
##5    eee 2016-05-04     222     1
##6    eee 2016-05-04     333     1
##7    eee 2016-05-06     222     1
##8    eee 2016-05-10     333     0

Answer 3

感谢这些伟大的想法; R是惊人的，在四行代码中进行相对复杂的会计。 我没有强调的另一件事是我还需要跟踪它是否是一个新帐户。 此外，由于这些数据不一定排序，我先排序，所以这是最终版本。

    df %>% 
      arrange(Person,account) %>%
      group_by(Person,account) %>% 
      mutate(keep=as.numeric(date2 - first(date2) <4)) %>% 
      select(-date2)

结果：

    Person      date account  keep
    <chr>     <chr>   <chr> <dbl>
1    abc  4/1/2016     123     1
2    abc  4/3/2016     123     1
3    abc 4/12/2016     123     0
4    eee  5/3/2016     222     1
5    eee  5/4/2016     222     1
6    eee  5/6/2016     222     1
7    eee 5/10/2016     333     1
8    eee 5/11/2016     333     1

所以我们保留最后一行，因为距离333帐户首次出现仅一天。

在R数据帧中标记组内的数据

问题描述

3 个解决方案

解决方案1
1 2016-09-01 22:35:00

解决方案2
1 2016-09-01 22:41:06

解决方案3
1 2016-09-02 13:41:05

在R数据帧中标记组内的数据

问题描述

3 个解决方案

解决方案1 1 2016-09-01 22:35:00

解决方案2 1 2016-09-01 22:41:06

解决方案3 1 2016-09-02 13:41:05

解决方案1
1 2016-09-01 22:35:00

解决方案2
1 2016-09-01 22:41:06

解决方案3
1 2016-09-02 13:41:05