简体   繁体   中英

how to fill NA with 2 columns

I want to fill NA value to my dataset. I am not sure if it is possible to do as following or not:

I have 3 columns, I want to fill NA of distance out

         duration    distance       mode
           15            7            car
            20           6             walk
           13            NA             car
            20           8             car
            18           NA            walk
           30           10            walk

for each mode I want to find closest duration and put in NA for distance

for mode car , the closest duration to 13 is 15 so first NA is 7, for second NA (which is walk mode), the closest duration to 18 is 20 so NA is 6.

Here's a data.table solution:

library(data.table)

dt[is.na(distance),
   distance := {dt[!is.na(distance)
                   ][.SD,
                     on = .(mode),
                     distance[which.min(abs(duration - i.duration))],
                     by = .EACHI]$V1
     }
   ]

dt

#   duration distance mode
#1:       15        7  car
#2:       20        6 walk
#3:       13        7  car
#4:       20        8  car
#5:       18        6 walk
#6:       30       10 walk
#7:       35       10 walk

It:

  1. Subsets the dataframe to only allow na values
  2. Self-joins with the only non_NA values based on the mode of transportation.
  3. Determines which is the minimum distance.

Data:

library(data.table)
DT <-          fread('duration    distance       mode
15            7            car
20           6             walk
13            NA             car
20           8             car
18           NA            walk
30           10            walk
35            NA            walk')

A way in base R could be to separate NA and non-NA groups. For every value in NA_group we find the closest duration in non_NA_group in same mode and return the corresponding distance .

NA_group <- subset(df, is.na(distance))
non_NA_group <- subset(df, !is.na(distance))

df$distance[is.na(df$distance)] <- mapply(function(x, y) {
    temp <- subset(non_NA_group, mode == y)
    temp$distance[which.min(abs(x - temp$duration))]
} ,NA_group$duration, NA_group$mode)

df
#  duration distance mode
#1       15        7  car
#2       20        6 walk
#3       13        7  car
#4       20        8  car
#5       18        6 walk
#6       30       10 walk

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM