[英]Conditionally remove rows based on date and time
I am trying to implement a way to filter this dataframe df
我正在尝试实现一种过滤此数据帧df
structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey",
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20",
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45",
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15",
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04",
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")
I want to filter it such that if there are duplicate responses in the Name
column like how Jim appears twice I would like to keep the row that has the earliest date and time according to the EndDate
column ONLY if the Progress
column value is greater than 70. Otherwise I want to take the row that has a later date and time in the EndDate
column. 我想过滤它,以便如果在Name
列中有重复的响应,就像Jim出现两次一样,如果Progress
列值大于70,我想根据EndDate
列保留具有最早日期和时间的行。否则我想在EndDate
列中获取具有较晚日期和时间的行。
Based on the condition, we convert the 'EndDate' to DateTime
class, then arrange
by 'Name', 'EndDate', grouped by 'Name' if
the first
element of 'Progres' is greater than 70 return index 1 or else
the last row index in slice
to subset the rows 根据病情的不同,我们转换“结束日期”,以DateTime
类,则arrange
由“姓名”,“结束日期”,通过“名称”组合if
在first
的“进步党”元素是大于70回报指数1或else
最后slice
行索引以对行进行子集化
library(tidyverse)
library(lubridate)
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
# if there are multiple formats
# mutate(EndDate = anytime::anytime(EndDate)) %>%
arrange(Name, EndDate) %>%
group_by(Name) %>%
slice(if(first(Progress) > 70) 1 else n())
# A tibble: 7 x 3
# Groups: Name [7]
# Name Progress EndDate
# <chr> <chr> <dttm>
#1 Jane 70 2018-11-30 11:18:00
#2 Jim 25 2018-11-29 18:04:00
#3 Jose 80 2018-11-25 14:20:00
#4 Matt 20 2018-12-01 22:52:00
#5 Mickey 65 2018-11-29 18:15:00
#6 Peter 20 2018-11-26 12:07:00
#7 Tom 45 2018-12-02 15:27:00
NOTE: if there are multiple 'DateTime' formats, one option is anytime::anytime
instead of mdy_hm
注意:如果有多个'DateTime'格式,则一个选项是anytime::anytime
而不是mdy_hm
Using dplyr
, we first convert EndDate
to date time object using parse_date_time
from lubridate
then we group_by
Name
and select row with minimum EndDate
if Progress > 70
and number of rows for each Name
is more than 1 and maximum EndDate
otherwise. 使用dplyr
,我们先转换EndDate
使用最新的时间对象parse_date_time
从lubridate
那么我们group_by
Name
,并选择行以最小的EndDate
,如果Progress > 70
和行每个数Name
大于1和最大的EndDate
,否则。 If there is only one row for the Name
then we select only that one by default. 如果Name
只有一行,那么我们默认只选择那一行。
library(dplyr)
library(lubridate)
df %>%
mutate(EndDate = parse_date_time(EndDate,c("%m-%d-%y %H:%M","%Y-%m-%d %H:%M:%S"))) %>%
group_by(Name) %>%
slice(ifelse(n() > 1,
ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
# Name Progress EndDate
# <chr> <chr> <dttm>
#1 Jane 70 2018-11-30 11:18:00
#2 Jim 25 2018-11-29 18:04:00
#3 Jose 80 2018-11-25 14:20:00
#4 Matt 20 2018-12-01 22:52:00
#5 Mickey 65 2018-11-29 18:15:00
#6 Peter 20 2018-11-26 12:07:00
#7 Tom 45 2018-12-02 15:27:00
An (of course) this can also be done using data.table
(当然)这也可以使用data.table
完成
sample data 样本数据
df <- structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey",
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20",
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45",
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15",
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04",
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")
code 码
#create the data.table (can also be done using setDT(df) )
dt <- as.data.table( df )
#set the dates to a proper POSIXct-format
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
#order omn EndDate (by reference!)
setorder( dt, EndDate )
#summarise by Name, if first Progress >70 then keep it, else keep last Progress
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
benchmarks 基准
microbenchmark::microbenchmark(
data.table = {
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
setorder( dt, EndDate )
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
},
tidyverse1 = {
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
arrange(Name, EndDate) %>%
group_by(Name) %>%
slice(if(first(Progress) > 70) 1 else n())
},
tidyverse2 = {
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
group_by(Name) %>%
slice(ifelse(n() > 1,
ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.654241 2.030820 2.709023 2.556978 2.782023 30.36590 100
# tidyverse1 6.847731 7.218286 8.742247 7.516838 8.034861 72.00902 100
# tidyverse2 6.173201 6.506398 7.286639 6.764582 7.088591 52.10180 100
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.