[英]R delete rows based on values in previous rows
I am new to R and trying to delete rows based on values of previous rows. 我是R的新手,并尝试根据前一行的值删除行。 Sample data:
样本数据:
Cust_ID | Date | Value
500219 | 2016-04-11 12:00:00 | 0
500219 | 2016-04-12 16:00:00 | 0
500219 | 2016-04-14 11:00:00 | 1
500219 | 2016-04-15 12:00:00 | 1
500219 | 2016-05-23 09:00:00 | 0
500219 | 2016-05-02 19:00:00 | 0
500220 | 2016-04-11 12:00:00 | 0
500220 | 2016-04-14 11:00:00 | 1
500220 | 2016-04-15 12:00:00 | 1
500220 | 2016-05-23 09:00:00 | 0
500220 | 2016-05-02 19:00:00 | 0
I would like to maintain only the rows before Value = 1 for each Cust_ID giving the result: 我只想为每个Cust_ID保留Value = 1之前的行,以提供结果:
Cust_ID | Date | Value
500219 | 2016-04-11 12:00:00 | 0
500219 | 2016-04-12 16:00:00 | 0
500219 | 2016-04-14 11:00:00 | 1
500219 | 2016-04-15 12:00:00 | 1
500220 | 2016-04-11 12:00:00 | 0
500220 | 2016-04-14 11:00:00 | 1
500220 | 2016-04-15 12:00:00 | 1
Any help would be appreciated! 任何帮助,将不胜感激!
Here is a split-apply-combine method that keeps any values that are 1 as well as the values before the first 1 for each customer. 这是一个拆分应用合并方法,该方法将为每个客户保留任何值为1以及前1个值之前的值。
# split data by customer ID
myList <- split(df, df$Cust_ID)
# loop through ID list, drop desired rows, rbind resulting list
dfNew <- do.call(rbind, lapply(myList, function(i) {
drop <- which(i$Value==1)
i[c(1:drop[1], drop[-1]),]}))
which returns 哪个返回
dfNew
Cust_ID Date Value
500219.1 500219 2016-04-11 12:00:00 0
500219.2 500219 2016-04-12 16:00:00 0
500219.3 500219 2016-04-14 11:00:00 1
500219.4 500219 2016-04-15 12:00:00 1
500220.7 500220 2016-04-11 12:00:00 0
500220.8 500220 2016-04-14 11:00:00 1
500220.9 500220 2016-04-15 12:00:00 1
Note that this solution will not work if there are customer IDs that never have a value equal to 1. 请注意,如果有客户ID的值永远不等于1,则此解决方案将不起作用。
If you want to retain observations that never reach the 1 threshold, then use 如果要保留从未达到1阈值的观测值,请使用
dfNew <- do.call(rbind, lapply(myList, function(i) {
drop <- which(i$Value==1)
if(length(drop) != 0) i[c(1:drop[1], drop[-1]),]
else i}))
We can use data.table
. 我们可以使用
data.table
。 Convert the 'data.frame' to 'data.table' ( setDT(df1)
), grouped by 'Cust_ID', we get the sequence of max
of indexes where 'Value' is 1, and get the row index ( .I
) and use that to subset the data.table rows. 将'data.frame'转换为'data.table'(
setDT(df1)
),按'Cust_ID'分组,我们得到'Value'为1的索引max
序列,并获得行索引( .I
)并使用它作为data.table行的子集。
library(data.table)
setDT(df1)[df1[, if(any(Value == 1)) .I[seq(max(which(Value == 1)))]
else .I[1:.N] , by = Cust_ID]$V1]
# Cust_ID Date Value
#1: 500219 2016-04-11 12:00:00 0
#2: 500219 2016-04-12 16:00:00 0
#3: 500219 2016-04-14 11:00:00 1
#4: 500219 2016-04-15 12:00:00 1
#5: 500220 2016-04-11 12:00:00 0
#6: 500220 2016-04-14 11:00:00 1
#7: 500220 2016-04-15 12:00:00 1
Or using a similar approach with dplyr
或使用与
dplyr
类似的方法
library(dplyr)
df1 %>%
group_by(Cust_ID) %>%
slice(if(any(Value==1)) seq(max(which(Value==1))) else row_number())
# Cust_ID Date Value
# <int> <chr> <int>
#1 500219 2016-04-11 12:00:00 0
#2 500219 2016-04-12 16:00:00 0
#3 500219 2016-04-14 11:00:00 1
#4 500219 2016-04-15 12:00:00 1
#5 500220 2016-04-11 12:00:00 0
#6 500220 2016-04-14 11:00:00 1
#7 500220 2016-04-15 12:00:00 1
Looping approach: 循环方法:
cust <- 0
keep <- FALSE
keepers <- vector(mode = "logical", length = nrow(df))
## walk through the dataframe backwards
for(rec in nrow(df):1)
{
## have we been working with this customer?
if(df[rec,]$Cust_ID == cust)
{
if(df[rec,]$Value == 1 | keep == TRUE)
{
keepers[rec] = TRUE
keep <- TRUE
}
}
else
{
cust = df[rec,]$Cust_ID
if(df[rec,]$Value == 1)
{
keepers[rec] = TRUE
keep <- TRUE
}
else
{
keep <- FALSE
}
}
}
df <- df[keepers,]
df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.