[英]Get the max date based on multiple columns of an R dplyr / tidyverse dataframe
來自如下所示的 csv 文件:
日期 | 時間戳 | 單位 | 姓名 | 健康)狀況 | 對象 | 參數 | 屬性 1 | 屬性2 | 結果 |
---|---|---|---|---|---|---|---|---|---|
2019-07-31 | 2019-08-01 01:16:09 | 立方米 | n01 | a1 | o1 | 小憩 | TP | 在 | 34937 |
2019-07-31 | 2019-08-01 01:16:10 | 立方米 | n01 | a2 | o2 | 小憩 | TP | 出去 | 36673.09 |
2019-11-06 | 2019-11-18 20:21:06 | 毫克/升 | n01 | a3 | o3 | NO3 | TP | 出去 | 1 |
2019-11-06 | 2019-11-18 20:21:06 | 毫克/升 | n01 | z5 | o4 | 生化需 | IO | 在 | 220 |
2019-11-06 | 2019-11-18 20:21:06 | 毫克/升 | n01 | z5 | o4 | 生化需 | TP | 在 | 220 |
2019-11-06 | 2019-11-18 20:21:06 | 毫克/升 | n01 | z6 | o1 | NO2 | TP | 出去 | 0.31 |
2019-11-06 | 2019-11-18 20:21:13 | 毫克/升 | n01 | a11 | o4 | Ntot | IO | 在 | 47 |
2019-11-06 | 2019-11-18 20:21:13 | 毫克/升 | n01 | a11 | o4 | Ntot | TP | 在 | 47 |
2021-01-06 | 2021-01-07 02:15:06 | 立方米 | n01 | a1 | o1 | 小憩 | TP | 在 | 17909 |
2021-01-06 | 2021-01-07 02:15:07 | 立方米 | n01 | a2 | o2 | 小憩 | TP | 出去 | 19216.19 |
我想刪除列Date和列Condition中每個值的最后一個(或最大)時間戳的行。
結果表不應包含重復的時間戳“2019-11-18 20:21:06”和“2019-11-18 20:21:13”(其中Condition和Result值為 [z5, a11] 和 [220, 47]分別)。
日期 | 時間戳 | 單位 | 姓名 | 健康)狀況 | 對象 | 參數 | 屬性 1 | 屬性2 | 結果 |
---|---|---|---|---|---|---|---|---|---|
2019-07-31 | 2019-08-01 01:16:09 | 立方米 | n01 | a1 | o1 | 小憩 | TP | 在 | 34937 |
2019-07-31 | 2019-08-01 01:16:10 | 立方米 | n01 | a2 | o2 | 小憩 | TP | 出去 | 36673.09 |
2019-11-06 | 2019-11-18 20:21:06 | 毫克/升 | n01 | a3 | o3 | NO3 | TP | 出去 | 1 |
2019-11-06 | 2019-11-18 20:21:06 | 毫克/升 | n01 | z5 | o4 | 生化需 | IO | 在 | 220 |
2019-11-06 | 2019-11-18 20:21:06 | 毫克/升 | n01 | z6 | o1 | NO2 | TP | 出去 | 0.31 |
2019-11-06 | 2019-11-18 20:21:13 | 毫克/升 | n01 | a11 | o4 | Ntot | IO | 在 | 47 |
2021-01-06 | 2021-01-07 02:15:06 | 立方米 | n01 | a1 | o1 | 小憩 | TP | 在 | 17909 |
2021-01-06 | 2021-01-07 02:15:07 | 立方米 | n01 | a2 | o2 | 小憩 | TP | 出去 | 19216.19 |
library(tidyverse)
# Group per Date and Condition and filter max Timestamp
df <- read.csv("./Example.csv") %>%
mutate(Date = as.POSIXct(Date, format = "%Y-%m-%d")) %>%
mutate(Timestamp = as.POSIXct(Timestamp, format = "%Y-%m-%d %H:%M:%S")) %>%
group_by(Date, Condition) %>%
filter(Timestamp == max(Timestamp)) %>%
distinct()
write_csv(df, file = "./ExampleResult.csv")
但我無法得到預期的結果。
這種方法有什么問題? 還有其他更簡單的方法嗎?
謝謝!
您在max(Timestamp)
有多個值。 為了解決這個問題,我建議使用dplyr::slice_max
並設置with_ties = FALSE
。
這里有一些代碼可以得到你想要的。
df %>%
mutate(Date = as.POSIXct(Date, format = "%Y-%m-%d")) %>%
mutate(Timestamp = as.POSIXct(Timestamp, format = "%Y-%m-%d %H:%M:%S")) %>%
group_by(Date, Condition) %>%
slice_max(order_by = Timestamp, n = 1, with_ties = FALSE)
但是根據您的應用程序,您可能希望通過向order_by
參數提供其他變量來明確說明如何解決這些關系。
嘗試使用以下內容:
library(dplyr)
read.csv("./Example.csv") %>%
#df %>%
mutate(Date = as.Date(Date),
Timestamp = as.POSIXct(Timestamp, format = "%Y-%m-%d %H:%M:%S")) %>%
distinct(Date, Condition, Result, .keep_all = TRUE) -> result
result
# Date Timestamp Units Name Condition Obj Param Attrib1 Atrrib2 Result
#1 2019-07-31 2019-08-01 01:16:09 m3 n01 a1 o1 Nap TP IN 34937.00
#2 2019-07-31 2019-08-01 01:16:10 m3 n01 a2 o2 Nap TP OUT 36673.09
#3 2019-11-06 2019-11-18 20:21:06 mg/l n01 a3 o3 NO3 TP OUT 1.00
#4 2019-11-06 2019-11-18 20:21:06 mg/l n01 z5 o4 BOD IO IN 220.00
#5 2019-11-06 2019-11-18 20:21:06 mg/l n01 z6 o1 NO2 TP OUT 0.31
#6 2019-11-06 2019-11-18 20:21:13 mg/l n01 a11 o4 Ntot IO IN 47.00
#7 2021-01-06 2021-01-07 02:15:06 m3 n01 a1 o1 Nap TP IN 17909.00
#8 2021-01-06 2021-01-07 02:15:07 m3 n01 a2 o2 Nap TP OUT 19216.19
數據
df <- structure(list(Date = c("2019-07-31", "2019-07-31", "2019-11-06",
"2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06",
"2021-01-06", "2021-01-06"), Timestamp = c("2019-08-01 01:16:09",
"2019-08-01 01:16:10", "2019-11-18 20:21:06", "2019-11-18 20:21:06",
"2019-11-18 20:21:06", "2019-11-18 20:21:06", "2019-11-18 20:21:13",
"2019-11-18 20:21:13", "2021-01-07 02:15:06", "2021-01-07 02:15:07"
), Units = c("m3", "m3", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l",
"mg/l", "m3", "m3"), Name = c("n01", "n01", "n01", "n01", "n01",
"n01", "n01", "n01", "n01", "n01"), Condition = c("a1", "a2",
"a3", "z5", "z5", "z6", "a11", "a11", "a1", "a2"), Obj = c("o1",
"o2", "o3", "o4", "o4", "o1", "o4", "o4", "o1", "o2"), Param = c("Nap",
"Nap", "NO3", "BOD", "BOD", "NO2", "Ntot", "Ntot", "Nap", "Nap"
), Attrib1 = c("TP", "TP", "TP", "IO", "TP", "TP", "IO", "TP",
"TP", "TP"), Atrrib2 = c("IN", "OUT", "OUT", "IN", "IN", "OUT",
"IN", "IN", "IN", "OUT"), Result = c(34937, 36673.09, 1, 220,
220, 0.31, 47, 47, 17909, 19216.19)),class = "data.frame",row.names = c(NA,-10L))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.