![](/img/trans.png)
[英]How to group by and fill NA with closest not NA in R dataframe column with condition on another column
[英]How to do a conditional NA fill in R dataframe
這可能很簡單,但無法弄清楚。 如何在數據框dt
中使用如下條件填充feature
列中的NA
。
填寫NA的條件是:
1
,則用前一行的值填充NA
(通過填充 tidyverse 的 function 輕松完成)dt_fl<-dt%>%
fill(feature, .direction = "down")
dt_fl
>1
,則用先前的特征值 +1 填充NA
並將以下行(特征值)替換為1
增量以生成連續的特征值。 dt_output
顯示了在填充NA
值並相應地替換特征編號后我對dt
的期望。dt<-structure(list(Date = structure(c(15126, 15127, 15128, 15129,
15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141,
15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"),
feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA,
2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame"))
dt
dt_output<-structure(list(Date = structure(c(15126, 15127, 15128, 15129,
15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141,
15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"),
feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA,
2, 2, 2, 2, 2, 2, NA), finaloutput = c(1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), row.names = c(NA,
-21L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character",
"collector")), feature = structure(list(), class = c("collector_double",
"collector")), finaloutput = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
dt_output
另外,按照 Ben 的建議,如果數據框以dt2
中的NA
功能開頭,如何解決? dt2
的預期 output 在dt2_output
dt2<-structure(list(Date = structure(c(13675, 13676, 13677, 13678,
13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"),
feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
dt2_output<-structure(list(Date = structure(c(13675, 13676, 13677, 13678,
13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"),
feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2), output_feature = c(1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3)), row.names = c(NA, -12L
), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character",
"collector")), feature = structure(list(), class = c("collector_double",
"collector")), output_feature = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Ben 提供的解決方案適用於所有條件,除了dt3
中的 1 個條件(如下),只是想知道為什么會這樣。 我的假設是第二種解決方案應該為dt3_expected
提供dt3
。
dt3<-structure(list(Date = structure(c(10063, 10064, 10065, 10066,
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083,
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1,
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA)), row.names = c(NA,
-19L), class = c("tbl_df", "tbl", "data.frame"))
dt3
dt3_expected<-structure(list(Date = structure(c(10063, 10064, 10065, 10066,
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083,
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1,
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA), output_feature = c(1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), row.names = c(NA,
-19L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character",
"collector")), feature = structure(list(), class = c("collector_double",
"collector")), output_feature = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
非常感謝您的幫助,謝謝。
您可以嘗試創建一個“偏移量”,只要您有缺失值且日期差異大於 1 天,就會添加該偏移量。 可以將此累積偏移量添加到您的feature
值以確定最終finaloutput
。
dt %>%
mutate(offset = cumsum(is.na(feature) & Date - lag(Date) > 1)) %>%
fill(feature, .direction = "down") %>%
mutate(finaloutput = feature + offset)
Output
# A tibble: 21 x 4
Date feature offset finaloutput
<date> <dbl> <int> <dbl>
1 2011-06-01 1 0 1
2 2011-06-02 1 0 1
3 2011-06-03 1 0 1
4 2011-06-04 1 0 1
5 2011-06-05 1 0 1
6 2011-06-06 1 0 1
7 2011-06-07 1 0 1
8 2011-06-08 1 0 1
9 2011-06-09 1 0 1
10 2011-06-13 1 1 2
11 2011-06-14 1 1 2
12 2011-06-15 1 1 2
13 2011-06-16 1 1 2
14 2011-06-17 1 1 2
15 2011-06-18 2 1 3
16 2011-06-19 2 1 3
17 2011-06-20 2 1 3
18 2011-06-21 2 1 3
19 2011-06-22 2 1 3
20 2011-06-23 2 1 3
21 2011-06-24 2 1 3
編輯:使用以NA
開頭的第二個示例dt2
,您可以嘗試以下操作。
首先,您可以為lag
添加default
。 在第一行是NA
的情況下,它將評估Date
的差異。 由於沒有之前的Date
可比較,您可以使用超過 1 天的默認值,以便添加偏移量,這些初始NA
將被視為“第一個” feature
。
第二個問題是當您無法fill
向下方向時填寫NA
(以NA
開頭時沒有先前feature
值)。 您可以將它們替換為 0。給定offset
,這將成為 0 + 1 = 1 的finaloutput
。
dt2 %>%
mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
fill(feature, .direction = "down") %>%
replace_na(list(feature = 0)) %>%
mutate(finaloutput = feature + offset)
Output
Date feature offset finaloutput
<date> <dbl> <int> <dbl>
1 2007-06-11 0 1 1
2 2007-06-12 0 1 1
3 2007-06-13 0 1 1
4 2007-06-14 0 1 1
5 2007-06-15 0 1 1
6 2007-06-25 1 1 2
7 2007-06-26 1 1 2
8 2007-06-27 1 1 2
9 2007-06-28 1 1 2
10 2007-06-29 1 1 2
11 2007-06-30 1 1 2
12 2007-07-01 2 1 3
編輯:有額外的評論,還有一個額外的標准要考慮。
如果Date
的差異 > 1 並且只有 2 個NA
,則第一個NA
應由前一個特征填充,第二個由下一個特征填充。 特別是,2 NA
中存在差距的第二個應以不同方式處理。
一種方法是計算連續NA
的數量。 然后,可以針對這種特殊情況填充feature
,其中兩個NA
中的第二個用Date
間隙標識。
dt3 %>%
mutate(grp = cumsum(c(1, abs(diff(is.na(feature))) == 1))) %>%
add_count(grp) %>%
ungroup %>%
mutate(feature = ifelse(is.na(feature) & n == 2 & is.na(lag(feature)), lead(feature), feature)) %>%
mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
fill(feature, .direction = "down") %>%
replace_na(list(feature = 0)) %>%
mutate(finaloutput = feature + offset)
Output
Date feature grp n offset finaloutput
<date> <dbl> <dbl> <int> <int> <dbl>
1 1997-07-21 1 1 7 0 1
2 1997-07-22 1 1 7 0 1
3 1997-07-23 1 1 7 0 1
4 1997-07-24 1 1 7 0 1
5 1997-07-25 1 1 7 0 1
6 1997-07-26 1 1 7 0 1
7 1997-07-27 1 1 7 0 1
8 1997-07-28 1 2 2 0 1
9 1997-08-06 2 2 2 0 2
10 1997-08-07 2 3 9 0 2
11 1997-08-08 2 3 9 0 2
12 1997-08-09 2 3 9 0 2
13 1997-08-10 2 3 9 0 2
14 1997-08-11 2 3 9 0 2
15 1997-08-12 2 3 9 0 2
16 1997-08-13 2 3 9 0 2
17 1997-08-14 2 3 9 0 2
18 1997-08-15 2 3 9 0 2
19 1997-08-16 2 4 1 0 2
請注意,這可以簡化; 但在這樣做之前,需要確保這符合您的需求。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.