简体   繁体   English

将 dataframe 中的 NA 值替换为组内其他列的第一个或最后一个值

[英]Replacing NA value in dataframe by first or last value of other columns within group

I have the following dataframe:我有以下 dataframe:

Group<-c(A,A,A,B,B,B)
Dates<-(c("01-01-2000","02-01-2000","03-01-2000","01-05-2020","02-05-2020","03-05-2020"))
Departure<-c("01-01-2000","01-01-2000","01-01-2000",NA,NA,NA)
Arrival<-c(NA,NA,NA,"03-02-2020","03-02-2020","03-02-2020")
Dates<-data.frame(Dates,Departure,Arrival)
Dates

 Group  Dates      Departure    Arrival
     1  01-01-2000 02-01-2000       <NA>
     1  02-01-2000 02-01-2000       <NA>
     1  03-01-2000 02-01-2000       <NA>
     2  01-05-2000       <NA> 31-12-2020
     2  02-05-2000       <NA> 31-12-2020
     2  03-05-2000       <NA> 31-12-2020

Here is what I want to do:这是我想做的事情:

  • For the "Departure" column: if the value is NOT NA, leave as is.对于“Departure”列:如果值为 NOT NA,则保持原样。 If the value is NA, then replace with the FIRST value of the "Dates" column within each group.如果值为 NA,则替换为每个组中“日期”列的第一个值。
  • For the "Arrival" column: if the value is NOT NA, leave as is.对于“到达”列:如果值不是 NA,则保持原样。 If the value is NA, then replace with the LAST value of the "Dates" column within each group.如果值为 NA,则替换为每个组中“日期”列的 LAST 值。

I would then obtain the following dataframe:然后我将获得以下 dataframe:

 Group  Dates      Departure    Arrival
     1  01-01-2000 02-01-2000   03-01-2000
     1  02-01-2000 02-01-2000   03-01-2000
     1  03-01-2000 02-01-2000   03-01-2000
     2  01-05-2000 01-05-2000   31-12-2020
     2  02-05-2000 01-05-2000   31-12-2020
     2  03-05-2000 01-05-2000   31-12-2020

I'm thinking of using a combination of if else and group_by from dplyr, but beyond that I'm stuck.我正在考虑使用 dplyr 中的 if else 和 group_by 的组合,但除此之外我被卡住了。 Any suggestions would be appreciated!!任何建议,将不胜感激!!

An option is to use replace_na (from tidyr ) after grouping by 'Group' to replace the NA elements with either the first or last values of 'Dates' column一个选项是在按“组”分组后使用replace_na (来自tidyr )将NA元素替换为“日期”列的firstlast

library(dplyr)
library(tidyr)
df1 %>% 
   group_by(Group) %>% 
   mutate(Departure = replace_na(Departure, first(Dates)), 
          Arrival = replace_na(Arrival, last(Dates))) %>% 
   ungroup

NOTE: Here we assume that 'Dates' are already order ed.注意:这里我们假设“日期”已经order If not, take the min and max after converting to Date class如果不是,则在转换为Date class 后取minmax

library(lubridate)
df1 %>% 
   mutate(across(-Group, dmy)) %>%
   group_by(Group) %>% 
   mutate(Departure = replace_na(Departure, min(Dates)), 
          Arrival = replace_na(Arrival, max(Dates))) %>% 
   ungroup

A data.table option data.table选项

setDT(Dates)[
  ,
  .(
    Dates = Dates,
    Departure = replace(Departure, is.na(Departure), min(Dates)),
    Arrival = replace(Arrival, is.na(Arrival), max(Dates))
  ),
  Group
]

gives

   Group      Dates  Departure    Arrival
1:     A 01-01-2000 01-01-2000 03-01-2000
2:     A 02-01-2000 01-01-2000 03-01-2000
3:     A 03-01-2000 01-01-2000 03-01-2000
4:     B 01-05-2020 01-05-2020 03-02-2020
5:     B 02-05-2020 01-05-2020 03-02-2020
6:     B 03-05-2020 01-05-2020 03-02-2020

The OP has asked to replace NA values in a data.frame. OP 已要求替换 data.frame 中的NA值。

One of data.table 's strong points is the ability to update by reference , ie, to replace values without copying the whole dataset. data.table的强项之一是通过引用更新的能力,即在不复制整个数据集的情况下替换值

In addition, data.table 's fcoalesce() function is used together with Map() .另外, data.tablefcoalesce() function 与Map()一起使用。

library(data.table)
cols <- c("Departure", "Arrival")
setDT(df_Dates)[, (cols) := Map(fcoalesce, .SD, Dates[c(1L, .N)]), .SDcols = cols, by = Group]
df_Dates
 Group Dates Departure Arrival 1: A 01-01-2000 01-01-2000 03-01-2000 2: A 02-01-2000 01-01-2000 03-01-2000 3: A 03-01-2000 01-01-2000 03-01-2000 4: B 01-05-2020 01-05-2020 03-02-2020 5: B 02-05-2020 01-05-2020 03-02-2020 6: B 03-05-2020 01-05-2020 03-02-2020

Map() picks the first value of Dates in each groups for the first column Departures and the last value Dates[.N] for the second column Arrival when calling fcoalesce() . Map()在调用fcoalesce()时为第一列Departures选择每个组中的Dates的第一个值,为第二列Arrival选择最后一个值Dates[.N]

Please, note that the original dataset has been changed in place which can be verified by calling address() before and after.请注意,原始数据集已就地更改,可以通过前后调用address()来验证。

Using min(Dates) and max(Dates) instead of first(Dates) and last(Dates) , or Dates[1L] and Dates[.N] , resp., may lead to unexpected results with other datasets as Dates are given as character dates in the format DD-MM-YYYY which would be sorted on day of month, first.使用min(Dates)max(Dates)而不是first(Dates)last(Dates) ,或Dates[1L]Dates[.N] ,可能会导致与其他数据集的意外结果,因为Dates给出为DD-MM-YYYY格式的字符日期,将首先按月中的某天排序。

Data数据

df_Dates <- data.frame(
  Group = c("A", "A", "A", "B", "B", "B"), 
  Dates = c("01-01-2000", "02-01-2000", "03-01-2000", "01-05-2020", "02-05-2020", "03-05-2020"), 
  Departure = c("01-01-2000", "01-01-2000", "01-01-2000", NA, NA, NA), 
  Arrival = c(NA, NA, NA, "03-02-2020", "03-02-2020", "03-02-2020"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM