简体   繁体   中英

How to do a conditional NA fill in R dataframe

It may be simple but could not figure out. How to fill NA in the feature column with conditions as below in the data frame dt .

The conditions to fill NA are:

  1. if the difference in Date is 1 , fill the NA with the previous row's value (easily done by fill function of tidyverse)
dt_fl<-dt%>%
  fill(feature, .direction = "down")
dt_fl
  1. if the difference in the Date is >1 , then fill the NA with the previous feature value +1 and replace the following rows (feature values) with 1 increment to make continuous feature values. The dt_output shows what I am expecting from dt after filling NA values and replacing the feature numbers accordingly.
dt<-structure(list(Date = structure(c(15126, 15127, 15128, 15129, 
                    15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141, 
                    15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"), 
                    feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 
                    2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, -21L), class = c("tbl_df", 
                    "tbl", "data.frame"))
 dt

dt_output<-structure(list(Date = structure(c(15126, 15127, 15128, 15129, 
          15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141, 
          15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"), 
          feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 
          2, 2, 2, 2, 2, 2, NA), finaloutput = c(1, 1, 1, 1, 1, 1, 
          1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), row.names = c(NA, 
           -21L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
          "collector")), feature = structure(list(), class = c("collector_double", 
            "collector")), finaloutput = structure(list(), class = c("collector_double", 
          "collector"))), default = structure(list(), class = c("collector_guess", 
          "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
          "tbl_df", "tbl", "data.frame"))
dt_output

Also, following Ben's suggestion, if the data frame starts with NA feature like in dt2 how to fix it? Expected output for dt2 is in dt2_output

  dt2<-structure(list(Date = structure(c(13675, 13676, 13677, 13678, 
      13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"), 
    feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2)), row.names = c(NA, 
    -12L), class = c("tbl_df", "tbl", "data.frame"))
dt2_output<-structure(list(Date = structure(c(13675, 13676, 13677, 13678, 
              13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"), 
              feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2), output_feature = c(1, 
              1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3)), row.names = c(NA, -12L
              ), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
              "collector")), feature = structure(list(), class = c("collector_double", 
              "collector")), output_feature = structure(list(), class = c("collector_double", 
              "collector"))), default = structure(list(), class = c("collector_guess", 
              "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
              "tbl_df", "tbl", "data.frame"))

The solution Ben provides works fine for all the conditions except in 1 condition like in dt3 (below), just wondering why it is so. My assumption is the second solution should give dt3_expected for dt3 .

dt3<-structure(list(Date = structure(c(10063, 10064, 10065, 10066, 
     10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083, 
     10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1, 
     1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, 
    -19L), class = c("tbl_df", "tbl", "data.frame"))

dt3

dt3_expected<-structure(list(Date = structure(c(10063, 10064, 10065, 10066, 
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083, 
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1, 
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA), output_feature = c(1, 
 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), row.names = c(NA, 
-19L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
 "collector")), feature = structure(list(), class = c("collector_double", 
"collector")), output_feature = structure(list(), class = c("collector_double", 
  "collector"))), default = structure(list(), class = c("collector_guess", 
  "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
  "tbl_df", "tbl", "data.frame"))

The help is greatly appreciated, thank you.

You could try creating an "offset" that is added whenever you have missing values and a difference in dates greater than 1 day. This cumulative offset can be added to your feature value to determine the finaloutput .

dt %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date) > 1)) %>%
  fill(feature, .direction = "down") %>%
  mutate(finaloutput = feature + offset)

Output

# A tibble: 21 x 4
   Date       feature offset finaloutput
   <date>       <dbl>  <int>       <dbl>
 1 2011-06-01       1      0           1
 2 2011-06-02       1      0           1
 3 2011-06-03       1      0           1
 4 2011-06-04       1      0           1
 5 2011-06-05       1      0           1
 6 2011-06-06       1      0           1
 7 2011-06-07       1      0           1
 8 2011-06-08       1      0           1
 9 2011-06-09       1      0           1
10 2011-06-13       1      1           2
11 2011-06-14       1      1           2
12 2011-06-15       1      1           2
13 2011-06-16       1      1           2
14 2011-06-17       1      1           2
15 2011-06-18       2      1           3
16 2011-06-19       2      1           3
17 2011-06-20       2      1           3
18 2011-06-21       2      1           3
19 2011-06-22       2      1           3
20 2011-06-23       2      1           3
21 2011-06-24       2      1           3

Edit : With the second example dt2 that begins with NA , you can try the following.

First, you can add a default for lag . In the case where the first row is NA , it will evaluate for a difference in Date . Since there is no prior Date to compare with, you can use a default of more than 1 day, so that an offset will be added and these initial NA will be considered the "first" feature .

The second issue is filling in the NA when you can't fill in the down direction (no prior feature value when it starts with NA ). You can just replace these with 0. Given the offset , this will become finaloutput of 0 + 1 = 1.

dt2 %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
  fill(feature, .direction = "down") %>%
  replace_na(list(feature = 0)) %>%
  mutate(finaloutput = feature + offset)

Output

   Date       feature offset finaloutput
   <date>       <dbl>  <int>       <dbl>
 1 2007-06-11       0      1           1
 2 2007-06-12       0      1           1
 3 2007-06-13       0      1           1
 4 2007-06-14       0      1           1
 5 2007-06-15       0      1           1
 6 2007-06-25       1      1           2
 7 2007-06-26       1      1           2
 8 2007-06-27       1      1           2
 9 2007-06-28       1      1           2
10 2007-06-29       1      1           2
11 2007-06-30       1      1           2
12 2007-07-01       2      1           3

Edit : With additional comment, there is an additional criterion to consider.

If the difference in Date is > 1 and there are only 2 NA , the first NA should be filled by the previous feature, and the second by the following feature. In particular, the second of 2 NA where there is a gap should be dealt with differently.

One approach to this is to count the number of consecutive NA in a row. Then, feature can be filled in for this particular circumstance, where the second of two NA is identified with a Date gap.

dt3 %>%
  mutate(grp = cumsum(c(1, abs(diff(is.na(feature))) == 1))) %>%
  add_count(grp) %>%
  ungroup %>%
  mutate(feature = ifelse(is.na(feature) & n == 2 & is.na(lag(feature)), lead(feature), feature)) %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
  fill(feature, .direction = "down") %>%
  replace_na(list(feature = 0)) %>%
  mutate(finaloutput = feature + offset)

Output

   Date       feature   grp     n offset finaloutput
   <date>       <dbl> <dbl> <int>  <int>       <dbl>
 1 1997-07-21       1     1     7      0           1
 2 1997-07-22       1     1     7      0           1
 3 1997-07-23       1     1     7      0           1
 4 1997-07-24       1     1     7      0           1
 5 1997-07-25       1     1     7      0           1
 6 1997-07-26       1     1     7      0           1
 7 1997-07-27       1     1     7      0           1
 8 1997-07-28       1     2     2      0           1
 9 1997-08-06       2     2     2      0           2
10 1997-08-07       2     3     9      0           2
11 1997-08-08       2     3     9      0           2
12 1997-08-09       2     3     9      0           2
13 1997-08-10       2     3     9      0           2
14 1997-08-11       2     3     9      0           2
15 1997-08-12       2     3     9      0           2
16 1997-08-13       2     3     9      0           2
17 1997-08-14       2     3     9      0           2
18 1997-08-15       2     3     9      0           2
19 1997-08-16       2     4     1      0           2

Note that this could be simplified; but before doing so, will need to be sure this meets your needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM