[英]Replacing NA value in dataframe by first or last value of other columns within group
I have the following dataframe:我有以下 dataframe:
Group<-c(A,A,A,B,B,B)
Dates<-(c("01-01-2000","02-01-2000","03-01-2000","01-05-2020","02-05-2020","03-05-2020"))
Departure<-c("01-01-2000","01-01-2000","01-01-2000",NA,NA,NA)
Arrival<-c(NA,NA,NA,"03-02-2020","03-02-2020","03-02-2020")
Dates<-data.frame(Dates,Departure,Arrival)
Dates
Group Dates Departure Arrival
1 01-01-2000 02-01-2000 <NA>
1 02-01-2000 02-01-2000 <NA>
1 03-01-2000 02-01-2000 <NA>
2 01-05-2000 <NA> 31-12-2020
2 02-05-2000 <NA> 31-12-2020
2 03-05-2000 <NA> 31-12-2020
Here is what I want to do:这是我想做的事情:
I would then obtain the following dataframe:然后我将获得以下 dataframe:
Group Dates Departure Arrival
1 01-01-2000 02-01-2000 03-01-2000
1 02-01-2000 02-01-2000 03-01-2000
1 03-01-2000 02-01-2000 03-01-2000
2 01-05-2000 01-05-2000 31-12-2020
2 02-05-2000 01-05-2000 31-12-2020
2 03-05-2000 01-05-2000 31-12-2020
I'm thinking of using a combination of if else and group_by from dplyr, but beyond that I'm stuck.我正在考虑使用 dplyr 中的 if else 和 group_by 的组合,但除此之外我被卡住了。 Any suggestions would be appreciated!!任何建议,将不胜感激!!
An option is to use replace_na
(from tidyr
) after grouping by 'Group' to replace the NA
elements with either the first
or last
values of 'Dates' column一个选项是在按“组”分组后使用replace_na
(来自tidyr
)将NA
元素替换为“日期”列的first
或last
值
library(dplyr)
library(tidyr)
df1 %>%
group_by(Group) %>%
mutate(Departure = replace_na(Departure, first(Dates)),
Arrival = replace_na(Arrival, last(Dates))) %>%
ungroup
NOTE: Here we assume that 'Dates' are already order
ed.注意:这里我们假设“日期”已经order
。 If not, take the min
and max
after converting to Date
class如果不是,则在转换为Date
class 后取min
和max
library(lubridate)
df1 %>%
mutate(across(-Group, dmy)) %>%
group_by(Group) %>%
mutate(Departure = replace_na(Departure, min(Dates)),
Arrival = replace_na(Arrival, max(Dates))) %>%
ungroup
A data.table
option data.table
选项
setDT(Dates)[
,
.(
Dates = Dates,
Departure = replace(Departure, is.na(Departure), min(Dates)),
Arrival = replace(Arrival, is.na(Arrival), max(Dates))
),
Group
]
gives给
Group Dates Departure Arrival
1: A 01-01-2000 01-01-2000 03-01-2000
2: A 02-01-2000 01-01-2000 03-01-2000
3: A 03-01-2000 01-01-2000 03-01-2000
4: B 01-05-2020 01-05-2020 03-02-2020
5: B 02-05-2020 01-05-2020 03-02-2020
6: B 03-05-2020 01-05-2020 03-02-2020
The OP has asked to replace NA
values in a data.frame. OP 已要求替换 data.frame 中的NA
值。
One of data.table
's strong points is the ability to update by reference , ie, to replace values without copying the whole dataset. data.table
的强项之一是通过引用更新的能力,即在不复制整个数据集的情况下替换值。
In addition, data.table
's fcoalesce()
function is used together with Map()
.另外, data.table
的fcoalesce()
function 与Map()
一起使用。
library(data.table)
cols <- c("Departure", "Arrival")
setDT(df_Dates)[, (cols) := Map(fcoalesce, .SD, Dates[c(1L, .N)]), .SDcols = cols, by = Group]
df_Dates
Group Dates Departure Arrival 1: A 01-01-2000 01-01-2000 03-01-2000 2: A 02-01-2000 01-01-2000 03-01-2000 3: A 03-01-2000 01-01-2000 03-01-2000 4: B 01-05-2020 01-05-2020 03-02-2020 5: B 02-05-2020 01-05-2020 03-02-2020 6: B 03-05-2020 01-05-2020 03-02-2020
Map()
picks the first value of Dates
in each groups for the first column Departures
and the last value Dates[.N]
for the second column Arrival
when calling fcoalesce()
. Map()
在调用fcoalesce()
时为第一列Departures
选择每个组中的Dates
的第一个值,为第二列Arrival
选择最后一个值Dates[.N]
。
Please, note that the original dataset has been changed in place which can be verified by calling address()
before and after.请注意,原始数据集已就地更改,可以通过前后调用address()
来验证。
Using min(Dates)
and max(Dates)
instead of first(Dates)
and last(Dates)
, or Dates[1L]
and Dates[.N]
, resp., may lead to unexpected results with other datasets as Dates
are given as character dates in the format DD-MM-YYYY
which would be sorted on day of month, first.使用min(Dates)
和max(Dates)
而不是first(Dates)
和last(Dates)
,或Dates[1L]
和Dates[.N]
,可能会导致与其他数据集的意外结果,因为Dates
给出为DD-MM-YYYY
格式的字符日期,将首先按月中的某天排序。
df_Dates <- data.frame(
Group = c("A", "A", "A", "B", "B", "B"),
Dates = c("01-01-2000", "02-01-2000", "03-01-2000", "01-05-2020", "02-05-2020", "03-05-2020"),
Departure = c("01-01-2000", "01-01-2000", "01-01-2000", NA, NA, NA),
Arrival = c(NA, NA, NA, "03-02-2020", "03-02-2020", "03-02-2020"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.