将 dataframe 中的 NA 值替换为组内其他列的第一个或最后一个值

Question

I have the following dataframe:我有以下 dataframe：

Group<-c(A,A,A,B,B,B)
Dates<-(c("01-01-2000","02-01-2000","03-01-2000","01-05-2020","02-05-2020","03-05-2020"))
Departure<-c("01-01-2000","01-01-2000","01-01-2000",NA,NA,NA)
Arrival<-c(NA,NA,NA,"03-02-2020","03-02-2020","03-02-2020")
Dates<-data.frame(Dates,Departure,Arrival)
Dates

 Group  Dates      Departure    Arrival
     1  01-01-2000 02-01-2000       <NA>
     1  02-01-2000 02-01-2000       <NA>
     1  03-01-2000 02-01-2000       <NA>
     2  01-05-2000       <NA> 31-12-2020
     2  02-05-2000       <NA> 31-12-2020
     2  03-05-2000       <NA> 31-12-2020

Here is what I want to do:这是我想做的事情：

For the "Departure" column: if the value is NOT NA, leave as is.对于“Departure”列：如果值为 NOT NA，则保持原样。 If the value is NA, then replace with the FIRST value of the "Dates" column within each group.如果值为 NA，则替换为每个组中“日期”列的第一个值。
For the "Arrival" column: if the value is NOT NA, leave as is.对于“到达”列：如果值不是 NA，则保持原样。 If the value is NA, then replace with the LAST value of the "Dates" column within each group.如果值为 NA，则替换为每个组中“日期”列的 LAST 值。

I would then obtain the following dataframe:然后我将获得以下 dataframe：

 Group  Dates      Departure    Arrival
     1  01-01-2000 02-01-2000   03-01-2000
     1  02-01-2000 02-01-2000   03-01-2000
     1  03-01-2000 02-01-2000   03-01-2000
     2  01-05-2000 01-05-2000   31-12-2020
     2  02-05-2000 01-05-2000   31-12-2020
     2  03-05-2000 01-05-2000   31-12-2020

I'm thinking of using a combination of if else and group_by from dplyr, but beyond that I'm stuck.我正在考虑使用 dplyr 中的 if else 和 group_by 的组合，但除此之外我被卡住了。 Any suggestions would be appreciated!!任何建议，将不胜感激！！

Answer 1

An option is to use replace_na (from tidyr ) after grouping by 'Group' to replace the NA elements with either the first or last values of 'Dates' column一个选项是在按“组”分组后使用replace_na （来自tidyr ）将NA元素替换为“日期”列的first或last值

library(dplyr)
library(tidyr)
df1 %>% 
   group_by(Group) %>% 
   mutate(Departure = replace_na(Departure, first(Dates)), 
          Arrival = replace_na(Arrival, last(Dates))) %>% 
   ungroup

NOTE: Here we assume that 'Dates' are already order ed.注意：这里我们假设“日期”已经order 。 If not, take the min and max after converting to Date class如果不是，则在转换为Date class 后取min和max

library(lubridate)
df1 %>% 
   mutate(across(-Group, dmy)) %>%
   group_by(Group) %>% 
   mutate(Departure = replace_na(Departure, min(Dates)), 
          Arrival = replace_na(Arrival, max(Dates))) %>% 
   ungroup

Answer 2

A data.table option data.table选项

setDT(Dates)[
  ,
  .(
    Dates = Dates,
    Departure = replace(Departure, is.na(Departure), min(Dates)),
    Arrival = replace(Arrival, is.na(Arrival), max(Dates))
  ),
  Group
]

gives给

   Group      Dates  Departure    Arrival
1:     A 01-01-2000 01-01-2000 03-01-2000
2:     A 02-01-2000 01-01-2000 03-01-2000
3:     A 03-01-2000 01-01-2000 03-01-2000
4:     B 01-05-2020 01-05-2020 03-02-2020
5:     B 02-05-2020 01-05-2020 03-02-2020
6:     B 03-05-2020 01-05-2020 03-02-2020

Answer 3

The OP has asked to replace NA values in a data.frame. OP 已要求替换 data.frame 中的NA值。

One of data.table 's strong points is the ability to update by reference , ie, to replace values without copying the whole dataset. data.table的强项之一是通过引用更新的能力，即在不复制整个数据集的情况下替换值。

In addition, data.table 's fcoalesce() function is used together with Map() .另外， data.table的fcoalesce() function 与Map()一起使用。

library(data.table)
cols <- c("Departure", "Arrival")
setDT(df_Dates)[, (cols) := Map(fcoalesce, .SD, Dates[c(1L, .N)]), .SDcols = cols, by = Group]
df_Dates

 Group Dates Departure Arrival 1: A 01-01-2000 01-01-2000 03-01-2000 2: A 02-01-2000 01-01-2000 03-01-2000 3: A 03-01-2000 01-01-2000 03-01-2000 4: B 01-05-2020 01-05-2020 03-02-2020 5: B 02-05-2020 01-05-2020 03-02-2020 6: B 03-05-2020 01-05-2020 03-02-2020

Map() picks the first value of Dates in each groups for the first column Departures and the last value Dates[.N] for the second column Arrival when calling fcoalesce() . Map()在调用fcoalesce()时为第一列Departures选择每个组中的Dates的第一个值，为第二列Arrival选择最后一个值Dates[.N] 。

Please, note that the original dataset has been changed in place which can be verified by calling address() before and after.请注意，原始数据集已就地更改，可以通过前后调用address()来验证。

Using min(Dates) and max(Dates) instead of first(Dates) and last(Dates) , or Dates[1L] and Dates[.N] , resp., may lead to unexpected results with other datasets as Dates are given as character dates in the format DD-MM-YYYY which would be sorted on day of month, first.使用min(Dates)和max(Dates)而不是first(Dates)和last(Dates) ，或Dates[1L]和Dates[.N] ，可能会导致与其他数据集的意外结果，因为Dates给出为DD-MM-YYYY格式的字符日期，将首先按月中的某天排序。

Data数据

df_Dates <- data.frame(
  Group = c("A", "A", "A", "B", "B", "B"), 
  Dates = c("01-01-2000", "02-01-2000", "03-01-2000", "01-05-2020", "02-05-2020", "03-05-2020"), 
  Departure = c("01-01-2000", "01-01-2000", "01-01-2000", NA, NA, NA), 
  Arrival = c(NA, NA, NA, "03-02-2020", "03-02-2020", "03-02-2020"))

将 dataframe 中的 NA 值替换为组内其他列的第一个或最后一个值

问题描述

3 个解决方案

解决方案1
2 已采纳 2021-02-26 22:37:59

解决方案2
1 2021-02-26 22:47:07

解决方案3
0 2021-02-28 10:34:04

Data数据

将 dataframe 中的 NA 值替换为组内其他列的第一个或最后一个值

问题描述

3 个解决方案

解决方案1 2 已采纳 2021-02-26 22:37:59

解决方案2 1 2021-02-26 22:47:07

解决方案3 0 2021-02-28 10:34:04

Data数据

解决方案1
2 已采纳 2021-02-26 22:37:59

解决方案2
1 2021-02-26 22:47:07

解决方案3
0 2021-02-28 10:34:04