简体   繁体   English

在R数据帧中标记组内的数据

[英]Flagging data within groups in R dataframe

I have the following table in R dataframe 我在R数据帧中有下表

在此输入图像描述

I would like to write the logic that generates the "keep" column. 我想编写生成“保持”列的逻辑。 For each person I would like to flag accounts that has a transaction newer than 4 days, since first access. 对于每个人,我想标记自首次访问以来交易时间超过4天的帐户。 So the first line is a new account for this person so flag it. 所以第一行是这个人的新帐户,所以标记它。 The second line the dates are only 2 days apart so keep it too. 第二行日期只有2天,所以保持它。 The third line is 11 days since we first saw this account so we do NOT flag it. 第三行是我们第一次看到这个帐号后的11天,所以我们不会标记它。 The same logic goes for the next person. 同样的逻辑适用于下一个人。 Flag only accounts that is less than 4 days old. 仅标记不到4天的帐户。

I have rebuilt your data frame, try this solution: 我重建了你的数据框,尝试这个解决方案:

library(lubridate)
library(dplyr)

df <- data.frame(Person = c(rep("abc",3), rep("eee", 5)),
           date = c("4/1/2016", "4/3/2016", "4/12/2016", "5/3/2016", "5/4/2016","5/4/2016","5/6/2016", "5/10/2016"),
           account = c("123","123","123","222","222","333","222","333"), stringsAsFactors = F)

df$date2 <- mdy(df$date)

The best solution, as suggested by @thelatemail: @thelatemail建议的最佳解决方案:

df %>% 
group_by(Person) %>% 
mutate(keep=as.numeric(date2 - first(date2) <= 4)) %>% 
select(-date2)

Result: 结果:

 Person      date account keep
1    abc  4/1/2016     123    1
2    abc  4/3/2016     123    1
3    abc 4/12/2016     123    0
4    eee  5/3/2016     222    1
5    eee  5/4/2016     222    1
6    eee  5/4/2016     333    1
7    eee  5/6/2016     222    1
8    eee 5/10/2016     333    0

My more convoluted original solution (useful if the account creation date is not in the first line for each person): 我更复杂的原始解决方案(如果帐户创建日期不在每个人的第一行,则非常有用):

df %>% 
group_by(Person) %>% 
slice(which.min(date2)) %>%
select(Person, date2) %>%
rename(account_create = date2) %>%
merge(df, ., by = "Person") %>%
mutate(keep = as.numeric(date2 - account_create <= 4)) %>%
select(-c(date2, account_create))

Using data.table : 使用data.table

library(data.table)
setDT(df)[, Keep:=as.numeric(difftime(date,first(date),units="days") < 4), by=Person][]

We group by Person and then create the column Keep using the condition that the date is less than 4 days from the first(date) for the Person . 我们按Person ,然后创建柱Keep使用状态的date小于4从天first(date)Person

Here, we assume that the date column is a date-time object. 在这里,我们假设date列是date-time对象。 If the date column is read in as character strings, then we can do the conversion using: 如果将date列作为字符串读入,那么我们可以使用以下命令进行转换:

df$date <- as.POSIXct(df$date, format="%m/%d/%Y")

With the data given by: 随着数据给出:

df <- structure(list(Person = c("abc", "abc", "abc", "eee", "eee", 
"eee", "eee", "eee"), date = structure(c(1459483200, 1459656000, 
1460433600, 1462248000, 1462334400, 1462334400, 1462507200, 1462852800
), class = c("POSIXct", "POSIXt"), tzone = ""), account = c(123L, 
123L, 123L, 222L, 222L, 333L, 222L, 333L)), .Names = c("Person", 
"date", "account"), row.names = c(NA, -8L), class = "data.frame")

The result is: 结果是:

##  Person       date account  Keep
##1    abc 2016-04-01     123     1
##2    abc 2016-04-03     123     1
##3    abc 2016-04-12     123     0
##4    eee 2016-05-03     222     1
##5    eee 2016-05-04     222     1
##6    eee 2016-05-04     333     1
##7    eee 2016-05-06     222     1
##8    eee 2016-05-10     333     0

Thanks for these great ideas; 感谢这些伟大的想法; R is amazing, doing this relatively complicated accounting in four lines of code. R是惊人的,在四行代码中进行相对复杂的会计。 Another thing I did not emphasize is that I also need to keep track whether it is a new account or not. 我没有强调的另一件事是我还需要跟踪它是否是一个新帐户。 Also since this data is not necessarily sorted, I sorted it first, so here is the final version. 此外,由于这些数据不一定排序,我先排序,所以这是最终版本。

    df %>% 
      arrange(Person,account) %>%
      group_by(Person,account) %>% 
      mutate(keep=as.numeric(date2 - first(date2) <4)) %>% 
      select(-date2)

Result: 结果:

    Person      date account  keep
    <chr>     <chr>   <chr> <dbl>
1    abc  4/1/2016     123     1
2    abc  4/3/2016     123     1
3    abc 4/12/2016     123     0
4    eee  5/3/2016     222     1
5    eee  5/4/2016     222     1
6    eee  5/6/2016     222     1
7    eee 5/10/2016     333     1
8    eee 5/11/2016     333     1

So we keep the last line since it is only 1 day from when the 333 account first showed up. 所以我们保留最后一行,因为距离333帐户首次出现仅一天。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM