如何做一个有条件的NA填写R dataframe

Question

这可能很简单，但无法弄清楚。 如何在数据框dt中使用如下条件填充feature列中的NA 。

填写NA的条件是：

如果 Date 的差异为1 ，则用前一行的值填充NA （通过填充 tidyverse 的 function 轻松完成）

dt_fl<-dt%>%
  fill(feature, .direction = "down")
dt_fl

如果 Date 中的差异>1 ，则用先前的特征值 +1 填充NA并将以下行（特征值）替换为1增量以生成连续的特征值。 dt_output显示了在填充NA值并相应地替换特征编号后我对dt的期望。

dt<-structure(list(Date = structure(c(15126, 15127, 15128, 15129, 
                    15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141, 
                    15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"), 
                    feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 
                    2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, -21L), class = c("tbl_df", 
                    "tbl", "data.frame"))
 dt

dt_output<-structure(list(Date = structure(c(15126, 15127, 15128, 15129, 
          15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141, 
          15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"), 
          feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 
          2, 2, 2, 2, 2, 2, NA), finaloutput = c(1, 1, 1, 1, 1, 1, 
          1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), row.names = c(NA, 
           -21L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
          "collector")), feature = structure(list(), class = c("collector_double", 
            "collector")), finaloutput = structure(list(), class = c("collector_double", 
          "collector"))), default = structure(list(), class = c("collector_guess", 
          "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
          "tbl_df", "tbl", "data.frame"))
dt_output

另外，按照 Ben 的建议，如果数据框以dt2中的NA功能开头，如何解决？ dt2的预期 output 在dt2_output

  dt2<-structure(list(Date = structure(c(13675, 13676, 13677, 13678, 
      13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"), 
    feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2)), row.names = c(NA, 
    -12L), class = c("tbl_df", "tbl", "data.frame"))

dt2_output<-structure(list(Date = structure(c(13675, 13676, 13677, 13678, 
              13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"), 
              feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2), output_feature = c(1, 
              1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3)), row.names = c(NA, -12L
              ), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
              "collector")), feature = structure(list(), class = c("collector_double", 
              "collector")), output_feature = structure(list(), class = c("collector_double", 
              "collector"))), default = structure(list(), class = c("collector_guess", 
              "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
              "tbl_df", "tbl", "data.frame"))

Ben 提供的解决方案适用于所有条件，除了dt3中的 1 个条件（如下），只是想知道为什么会这样。 我的假设是第二种解决方案应该为dt3_expected提供dt3 。

dt3<-structure(list(Date = structure(c(10063, 10064, 10065, 10066, 
     10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083, 
     10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1, 
     1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, 
    -19L), class = c("tbl_df", "tbl", "data.frame"))

dt3


dt3_expected<-structure(list(Date = structure(c(10063, 10064, 10065, 10066, 
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083, 
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1, 
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA), output_feature = c(1, 
 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), row.names = c(NA, 
-19L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
 "collector")), feature = structure(list(), class = c("collector_double", 
"collector")), output_feature = structure(list(), class = c("collector_double", 
  "collector"))), default = structure(list(), class = c("collector_guess", 
  "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
  "tbl_df", "tbl", "data.frame"))

非常感谢您的帮助，谢谢。

Answer 1

您可以尝试创建一个“偏移量”，只要您有缺失值且日期差异大于 1 天，就会添加该偏移量。 可以将此累积偏移量添加到您的feature值以确定最终finaloutput 。

dt %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date) > 1)) %>%
  fill(feature, .direction = "down") %>%
  mutate(finaloutput = feature + offset)

Output

# A tibble: 21 x 4
   Date       feature offset finaloutput
   <date>       <dbl>  <int>       <dbl>
 1 2011-06-01       1      0           1
 2 2011-06-02       1      0           1
 3 2011-06-03       1      0           1
 4 2011-06-04       1      0           1
 5 2011-06-05       1      0           1
 6 2011-06-06       1      0           1
 7 2011-06-07       1      0           1
 8 2011-06-08       1      0           1
 9 2011-06-09       1      0           1
10 2011-06-13       1      1           2
11 2011-06-14       1      1           2
12 2011-06-15       1      1           2
13 2011-06-16       1      1           2
14 2011-06-17       1      1           2
15 2011-06-18       2      1           3
16 2011-06-19       2      1           3
17 2011-06-20       2      1           3
18 2011-06-21       2      1           3
19 2011-06-22       2      1           3
20 2011-06-23       2      1           3
21 2011-06-24       2      1           3

编辑：使用以NA开头的第二个示例dt2 ，您可以尝试以下操作。

首先，您可以为lag添加default 。 在第一行是NA的情况下，它将评估Date的差异。 由于没有之前的Date可比较，您可以使用超过 1 天的默认值，以便添加偏移量，这些初始NA将被视为“第一个” feature 。

第二个问题是当您无法fill向下方向时填写NA （以NA开头时没有先前feature值）。 您可以将它们替换为 0。给定offset ，这将成为 0 + 1 = 1 的finaloutput 。

dt2 %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
  fill(feature, .direction = "down") %>%
  replace_na(list(feature = 0)) %>%
  mutate(finaloutput = feature + offset)

Output

   Date       feature offset finaloutput
   <date>       <dbl>  <int>       <dbl>
 1 2007-06-11       0      1           1
 2 2007-06-12       0      1           1
 3 2007-06-13       0      1           1
 4 2007-06-14       0      1           1
 5 2007-06-15       0      1           1
 6 2007-06-25       1      1           2
 7 2007-06-26       1      1           2
 8 2007-06-27       1      1           2
 9 2007-06-28       1      1           2
10 2007-06-29       1      1           2
11 2007-06-30       1      1           2
12 2007-07-01       2      1           3

编辑：有额外的评论，还有一个额外的标准要考虑。

如果Date的差异 > 1 并且只有 2 个NA ，则第一个NA应由前一个特征填充，第二个由下一个特征填充。 特别是，2 NA中存在差距的第二个应以不同方式处理。

一种方法是计算连续NA的数量。 然后，可以针对这种特殊情况填充feature ，其中两个NA中的第二个用Date间隙标识。

dt3 %>%
  mutate(grp = cumsum(c(1, abs(diff(is.na(feature))) == 1))) %>%
  add_count(grp) %>%
  ungroup %>%
  mutate(feature = ifelse(is.na(feature) & n == 2 & is.na(lag(feature)), lead(feature), feature)) %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
  fill(feature, .direction = "down") %>%
  replace_na(list(feature = 0)) %>%
  mutate(finaloutput = feature + offset)

Output

   Date       feature   grp     n offset finaloutput
   <date>       <dbl> <dbl> <int>  <int>       <dbl>
 1 1997-07-21       1     1     7      0           1
 2 1997-07-22       1     1     7      0           1
 3 1997-07-23       1     1     7      0           1
 4 1997-07-24       1     1     7      0           1
 5 1997-07-25       1     1     7      0           1
 6 1997-07-26       1     1     7      0           1
 7 1997-07-27       1     1     7      0           1
 8 1997-07-28       1     2     2      0           1
 9 1997-08-06       2     2     2      0           2
10 1997-08-07       2     3     9      0           2
11 1997-08-08       2     3     9      0           2
12 1997-08-09       2     3     9      0           2
13 1997-08-10       2     3     9      0           2
14 1997-08-11       2     3     9      0           2
15 1997-08-12       2     3     9      0           2
16 1997-08-13       2     3     9      0           2
17 1997-08-14       2     3     9      0           2
18 1997-08-15       2     3     9      0           2
19 1997-08-16       2     4     1      0           2

请注意，这可以简化； 但在这样做之前，需要确保这符合您的需求。

如何做一个有条件的NA填写R dataframe

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-01-19 01:41:01

如何做一个有条件的NA填写R dataframe

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-01-19 01:41:01

解决方案1
2 已采纳 2021-01-19 01:41:01