简体   繁体   English

如何做一个有条件的NA填写R dataframe

[英]How to do a conditional NA fill in R dataframe

It may be simple but could not figure out.这可能很简单,但无法弄清楚。 How to fill NA in the feature column with conditions as below in the data frame dt .如何在数据框dt中使用如下条件填充feature列中的NA

The conditions to fill NA are:填写NA的条件是:

  1. if the difference in Date is 1 , fill the NA with the previous row's value (easily done by fill function of tidyverse)如果 Date 的差异为1 ,则用前一行的值填充NA (通过填充 tidyverse 的 function 轻松完成)
dt_fl<-dt%>%
  fill(feature, .direction = "down")
dt_fl
  1. if the difference in the Date is >1 , then fill the NA with the previous feature value +1 and replace the following rows (feature values) with 1 increment to make continuous feature values.如果 Date 中的差异>1 ,则用先前的特征值 +1 填充NA并将以下行(特征值)替换为1增量以生成连续的特征值。 The dt_output shows what I am expecting from dt after filling NA values and replacing the feature numbers accordingly. dt_output显示了在填充NA值并相应地替换特征编号后我对dt的期望。
dt<-structure(list(Date = structure(c(15126, 15127, 15128, 15129, 
                    15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141, 
                    15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"), 
                    feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 
                    2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, -21L), class = c("tbl_df", 
                    "tbl", "data.frame"))
 dt

dt_output<-structure(list(Date = structure(c(15126, 15127, 15128, 15129, 
          15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141, 
          15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"), 
          feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 
          2, 2, 2, 2, 2, 2, NA), finaloutput = c(1, 1, 1, 1, 1, 1, 
          1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), row.names = c(NA, 
           -21L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
          "collector")), feature = structure(list(), class = c("collector_double", 
            "collector")), finaloutput = structure(list(), class = c("collector_double", 
          "collector"))), default = structure(list(), class = c("collector_guess", 
          "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
          "tbl_df", "tbl", "data.frame"))
dt_output

Also, following Ben's suggestion, if the data frame starts with NA feature like in dt2 how to fix it?另外,按照 Ben 的建议,如果数据框以dt2中的NA功能开头,如何解决? Expected output for dt2 is in dt2_output dt2的预期 output 在dt2_output

  dt2<-structure(list(Date = structure(c(13675, 13676, 13677, 13678, 
      13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"), 
    feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2)), row.names = c(NA, 
    -12L), class = c("tbl_df", "tbl", "data.frame"))
dt2_output<-structure(list(Date = structure(c(13675, 13676, 13677, 13678, 
              13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"), 
              feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2), output_feature = c(1, 
              1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3)), row.names = c(NA, -12L
              ), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
              "collector")), feature = structure(list(), class = c("collector_double", 
              "collector")), output_feature = structure(list(), class = c("collector_double", 
              "collector"))), default = structure(list(), class = c("collector_guess", 
              "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
              "tbl_df", "tbl", "data.frame"))

The solution Ben provides works fine for all the conditions except in 1 condition like in dt3 (below), just wondering why it is so. Ben 提供的解决方案适用于所有条件,除了dt3中的 1 个条件(如下),只是想知道为什么会这样。 My assumption is the second solution should give dt3_expected for dt3 .我的假设是第二种解决方案应该为dt3_expected提供dt3

dt3<-structure(list(Date = structure(c(10063, 10064, 10065, 10066, 
     10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083, 
     10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1, 
     1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, 
    -19L), class = c("tbl_df", "tbl", "data.frame"))

dt3

dt3_expected<-structure(list(Date = structure(c(10063, 10064, 10065, 10066, 
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083, 
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1, 
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA), output_feature = c(1, 
 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), row.names = c(NA, 
-19L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
 "collector")), feature = structure(list(), class = c("collector_double", 
"collector")), output_feature = structure(list(), class = c("collector_double", 
  "collector"))), default = structure(list(), class = c("collector_guess", 
  "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
  "tbl_df", "tbl", "data.frame"))

The help is greatly appreciated, thank you.非常感谢您的帮助,谢谢。

You could try creating an "offset" that is added whenever you have missing values and a difference in dates greater than 1 day.您可以尝试创建一个“偏移量”,只要您有缺失值且日期差异大于 1 天,就会添加该偏移量。 This cumulative offset can be added to your feature value to determine the finaloutput .可以将此累积偏移量添加到您的feature值以确定最终finaloutput

dt %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date) > 1)) %>%
  fill(feature, .direction = "down") %>%
  mutate(finaloutput = feature + offset)

Output Output

# A tibble: 21 x 4
   Date       feature offset finaloutput
   <date>       <dbl>  <int>       <dbl>
 1 2011-06-01       1      0           1
 2 2011-06-02       1      0           1
 3 2011-06-03       1      0           1
 4 2011-06-04       1      0           1
 5 2011-06-05       1      0           1
 6 2011-06-06       1      0           1
 7 2011-06-07       1      0           1
 8 2011-06-08       1      0           1
 9 2011-06-09       1      0           1
10 2011-06-13       1      1           2
11 2011-06-14       1      1           2
12 2011-06-15       1      1           2
13 2011-06-16       1      1           2
14 2011-06-17       1      1           2
15 2011-06-18       2      1           3
16 2011-06-19       2      1           3
17 2011-06-20       2      1           3
18 2011-06-21       2      1           3
19 2011-06-22       2      1           3
20 2011-06-23       2      1           3
21 2011-06-24       2      1           3

Edit : With the second example dt2 that begins with NA , you can try the following.编辑:使用以NA开头的第二个示例dt2 ,您可以尝试以下操作。

First, you can add a default for lag .首先,您可以为lag添加default In the case where the first row is NA , it will evaluate for a difference in Date .在第一行是NA的情况下,它将评估Date的差异。 Since there is no prior Date to compare with, you can use a default of more than 1 day, so that an offset will be added and these initial NA will be considered the "first" feature .由于没有之前的Date可比较,您可以使用超过 1 天的默认值,以便添加偏移量,这些初始NA将被视为“第一个” feature

The second issue is filling in the NA when you can't fill in the down direction (no prior feature value when it starts with NA ).第二个问题是当您无法fill向下方向时填写NA (以NA开头时没有先前feature值)。 You can just replace these with 0. Given the offset , this will become finaloutput of 0 + 1 = 1.您可以将它们替换为 0。给定offset ,这将成为 0 + 1 = 1 的finaloutput

dt2 %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
  fill(feature, .direction = "down") %>%
  replace_na(list(feature = 0)) %>%
  mutate(finaloutput = feature + offset)

Output Output

   Date       feature offset finaloutput
   <date>       <dbl>  <int>       <dbl>
 1 2007-06-11       0      1           1
 2 2007-06-12       0      1           1
 3 2007-06-13       0      1           1
 4 2007-06-14       0      1           1
 5 2007-06-15       0      1           1
 6 2007-06-25       1      1           2
 7 2007-06-26       1      1           2
 8 2007-06-27       1      1           2
 9 2007-06-28       1      1           2
10 2007-06-29       1      1           2
11 2007-06-30       1      1           2
12 2007-07-01       2      1           3

Edit : With additional comment, there is an additional criterion to consider.编辑:有额外的评论,还有一个额外的标准要考虑。

If the difference in Date is > 1 and there are only 2 NA , the first NA should be filled by the previous feature, and the second by the following feature.如果Date的差异 > 1 并且只有 2 个NA ,则第一个NA应由前一个特征填充,第二个由下一个特征填充。 In particular, the second of 2 NA where there is a gap should be dealt with differently.特别是,2 NA中存在差距的第二个应以不同方式处理。

One approach to this is to count the number of consecutive NA in a row.一种方法是计算连续NA的数量。 Then, feature can be filled in for this particular circumstance, where the second of two NA is identified with a Date gap.然后,可以针对这种特殊情况填充feature ,其中两个NA中的第二个用Date间隙标识。

dt3 %>%
  mutate(grp = cumsum(c(1, abs(diff(is.na(feature))) == 1))) %>%
  add_count(grp) %>%
  ungroup %>%
  mutate(feature = ifelse(is.na(feature) & n == 2 & is.na(lag(feature)), lead(feature), feature)) %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
  fill(feature, .direction = "down") %>%
  replace_na(list(feature = 0)) %>%
  mutate(finaloutput = feature + offset)

Output Output

   Date       feature   grp     n offset finaloutput
   <date>       <dbl> <dbl> <int>  <int>       <dbl>
 1 1997-07-21       1     1     7      0           1
 2 1997-07-22       1     1     7      0           1
 3 1997-07-23       1     1     7      0           1
 4 1997-07-24       1     1     7      0           1
 5 1997-07-25       1     1     7      0           1
 6 1997-07-26       1     1     7      0           1
 7 1997-07-27       1     1     7      0           1
 8 1997-07-28       1     2     2      0           1
 9 1997-08-06       2     2     2      0           2
10 1997-08-07       2     3     9      0           2
11 1997-08-08       2     3     9      0           2
12 1997-08-09       2     3     9      0           2
13 1997-08-10       2     3     9      0           2
14 1997-08-11       2     3     9      0           2
15 1997-08-12       2     3     9      0           2
16 1997-08-13       2     3     9      0           2
17 1997-08-14       2     3     9      0           2
18 1997-08-15       2     3     9      0           2
19 1997-08-16       2     4     1      0           2

Note that this could be simplified;请注意,这可以简化; but before doing so, will need to be sure this meets your needs.但在这样做之前,需要确保这符合您的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM