繁体   English   中英

如何做一个有条件的NA填写R dataframe

[英]How to do a conditional NA fill in R dataframe

这可能很简单,但无法弄清楚。 如何在数据框dt中使用如下条件填充feature列中的NA

填写NA的条件是:

  1. 如果 Date 的差异为1 ,则用前一行的值填充NA (通过填充 tidyverse 的 function 轻松完成)
dt_fl<-dt%>%
  fill(feature, .direction = "down")
dt_fl
  1. 如果 Date 中的差异>1 ,则用先前的特征值 +1 填充NA并将以下行(特征值)替换为1增量以生成连续的特征值。 dt_output显示了在填充NA值并相应地替换特征编号后我对dt的期望。
dt<-structure(list(Date = structure(c(15126, 15127, 15128, 15129, 
                    15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141, 
                    15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"), 
                    feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 
                    2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, -21L), class = c("tbl_df", 
                    "tbl", "data.frame"))
 dt

dt_output<-structure(list(Date = structure(c(15126, 15127, 15128, 15129, 
          15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141, 
          15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"), 
          feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 
          2, 2, 2, 2, 2, 2, NA), finaloutput = c(1, 1, 1, 1, 1, 1, 
          1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), row.names = c(NA, 
           -21L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
          "collector")), feature = structure(list(), class = c("collector_double", 
            "collector")), finaloutput = structure(list(), class = c("collector_double", 
          "collector"))), default = structure(list(), class = c("collector_guess", 
          "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
          "tbl_df", "tbl", "data.frame"))
dt_output

另外,按照 Ben 的建议,如果数据框以dt2中的NA功能开头,如何解决? dt2的预期 output 在dt2_output

  dt2<-structure(list(Date = structure(c(13675, 13676, 13677, 13678, 
      13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"), 
    feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2)), row.names = c(NA, 
    -12L), class = c("tbl_df", "tbl", "data.frame"))
dt2_output<-structure(list(Date = structure(c(13675, 13676, 13677, 13678, 
              13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"), 
              feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2), output_feature = c(1, 
              1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3)), row.names = c(NA, -12L
              ), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
              "collector")), feature = structure(list(), class = c("collector_double", 
              "collector")), output_feature = structure(list(), class = c("collector_double", 
              "collector"))), default = structure(list(), class = c("collector_guess", 
              "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
              "tbl_df", "tbl", "data.frame"))

Ben 提供的解决方案适用于所有条件,除了dt3中的 1 个条件(如下),只是想知道为什么会这样。 我的假设是第二种解决方案应该为dt3_expected提供dt3

dt3<-structure(list(Date = structure(c(10063, 10064, 10065, 10066, 
     10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083, 
     10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1, 
     1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, 
    -19L), class = c("tbl_df", "tbl", "data.frame"))

dt3

dt3_expected<-structure(list(Date = structure(c(10063, 10064, 10065, 10066, 
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083, 
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1, 
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA), output_feature = c(1, 
 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), row.names = c(NA, 
-19L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character", 
 "collector")), feature = structure(list(), class = c("collector_double", 
"collector")), output_feature = structure(list(), class = c("collector_double", 
  "collector"))), default = structure(list(), class = c("collector_guess", 
  "collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df", 
  "tbl_df", "tbl", "data.frame"))

非常感谢您的帮助,谢谢。

您可以尝试创建一个“偏移量”,只要您有缺失值且日期差异大于 1 天,就会添加该偏移量。 可以将此累积偏移量添加到您的feature值以确定最终finaloutput

dt %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date) > 1)) %>%
  fill(feature, .direction = "down") %>%
  mutate(finaloutput = feature + offset)

Output

# A tibble: 21 x 4
   Date       feature offset finaloutput
   <date>       <dbl>  <int>       <dbl>
 1 2011-06-01       1      0           1
 2 2011-06-02       1      0           1
 3 2011-06-03       1      0           1
 4 2011-06-04       1      0           1
 5 2011-06-05       1      0           1
 6 2011-06-06       1      0           1
 7 2011-06-07       1      0           1
 8 2011-06-08       1      0           1
 9 2011-06-09       1      0           1
10 2011-06-13       1      1           2
11 2011-06-14       1      1           2
12 2011-06-15       1      1           2
13 2011-06-16       1      1           2
14 2011-06-17       1      1           2
15 2011-06-18       2      1           3
16 2011-06-19       2      1           3
17 2011-06-20       2      1           3
18 2011-06-21       2      1           3
19 2011-06-22       2      1           3
20 2011-06-23       2      1           3
21 2011-06-24       2      1           3

编辑:使用以NA开头的第二个示例dt2 ,您可以尝试以下操作。

首先,您可以为lag添加default 在第一行是NA的情况下,它将评估Date的差异。 由于没有之前的Date可比较,您可以使用超过 1 天的默认值,以便添加偏移量,这些初始NA将被视为“第一个” feature

第二个问题是当您无法fill向下方向时填写NA (以NA开头时没有先前feature值)。 您可以将它们替换为 0。给定offset ,这将成为 0 + 1 = 1 的finaloutput

dt2 %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
  fill(feature, .direction = "down") %>%
  replace_na(list(feature = 0)) %>%
  mutate(finaloutput = feature + offset)

Output

   Date       feature offset finaloutput
   <date>       <dbl>  <int>       <dbl>
 1 2007-06-11       0      1           1
 2 2007-06-12       0      1           1
 3 2007-06-13       0      1           1
 4 2007-06-14       0      1           1
 5 2007-06-15       0      1           1
 6 2007-06-25       1      1           2
 7 2007-06-26       1      1           2
 8 2007-06-27       1      1           2
 9 2007-06-28       1      1           2
10 2007-06-29       1      1           2
11 2007-06-30       1      1           2
12 2007-07-01       2      1           3

编辑:有额外的评论,还有一个额外的标准要考虑。

如果Date的差异 > 1 并且只有 2 个NA ,则第一个NA应由前一个特征填充,第二个由下一个特征填充。 特别是,2 NA中存在差距的第二个应以不同方式处理。

一种方法是计算连续NA的数量。 然后,可以针对这种特殊情况填充feature ,其中两个NA中的第二个用Date间隙标识。

dt3 %>%
  mutate(grp = cumsum(c(1, abs(diff(is.na(feature))) == 1))) %>%
  add_count(grp) %>%
  ungroup %>%
  mutate(feature = ifelse(is.na(feature) & n == 2 & is.na(lag(feature)), lead(feature), feature)) %>%
  mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
  fill(feature, .direction = "down") %>%
  replace_na(list(feature = 0)) %>%
  mutate(finaloutput = feature + offset)

Output

   Date       feature   grp     n offset finaloutput
   <date>       <dbl> <dbl> <int>  <int>       <dbl>
 1 1997-07-21       1     1     7      0           1
 2 1997-07-22       1     1     7      0           1
 3 1997-07-23       1     1     7      0           1
 4 1997-07-24       1     1     7      0           1
 5 1997-07-25       1     1     7      0           1
 6 1997-07-26       1     1     7      0           1
 7 1997-07-27       1     1     7      0           1
 8 1997-07-28       1     2     2      0           1
 9 1997-08-06       2     2     2      0           2
10 1997-08-07       2     3     9      0           2
11 1997-08-08       2     3     9      0           2
12 1997-08-09       2     3     9      0           2
13 1997-08-10       2     3     9      0           2
14 1997-08-11       2     3     9      0           2
15 1997-08-12       2     3     9      0           2
16 1997-08-13       2     3     9      0           2
17 1997-08-14       2     3     9      0           2
18 1997-08-15       2     3     9      0           2
19 1997-08-16       2     4     1      0           2

请注意,这可以简化; 但在这样做之前,需要确保这符合您的需求。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM