It may be simple but could not figure out. How to fill NA
in the feature
column with conditions as below in the data frame dt
.
The conditions to fill NA are:
1
, fill the NA
with the previous row's value (easily done by fill function of tidyverse)dt_fl<-dt%>%
fill(feature, .direction = "down")
dt_fl
>1
, then fill the NA
with the previous feature value +1 and replace the following rows (feature values) with 1
increment to make continuous feature values. The dt_output
shows what I am expecting from dt
after filling NA
values and replacing the feature numbers accordingly. dt<-structure(list(Date = structure(c(15126, 15127, 15128, 15129,
15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141,
15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"),
feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA,
2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame"))
dt
dt_output<-structure(list(Date = structure(c(15126, 15127, 15128, 15129,
15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141,
15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"),
feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA,
2, 2, 2, 2, 2, 2, NA), finaloutput = c(1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), row.names = c(NA,
-21L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character",
"collector")), feature = structure(list(), class = c("collector_double",
"collector")), finaloutput = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
dt_output
Also, following Ben's suggestion, if the data frame starts with NA
feature like in dt2
how to fix it? Expected output for dt2
is in dt2_output
dt2<-structure(list(Date = structure(c(13675, 13676, 13677, 13678,
13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"),
feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
dt2_output<-structure(list(Date = structure(c(13675, 13676, 13677, 13678,
13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"),
feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2), output_feature = c(1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3)), row.names = c(NA, -12L
), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character",
"collector")), feature = structure(list(), class = c("collector_double",
"collector")), output_feature = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
The solution Ben provides works fine for all the conditions except in 1 condition like in dt3
(below), just wondering why it is so. My assumption is the second solution should give dt3_expected
for dt3
.
dt3<-structure(list(Date = structure(c(10063, 10064, 10065, 10066,
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083,
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1,
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA)), row.names = c(NA,
-19L), class = c("tbl_df", "tbl", "data.frame"))
dt3
dt3_expected<-structure(list(Date = structure(c(10063, 10064, 10065, 10066,
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083,
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1,
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA), output_feature = c(1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), row.names = c(NA,
-19L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character",
"collector")), feature = structure(list(), class = c("collector_double",
"collector")), output_feature = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
The help is greatly appreciated, thank you.
You could try creating an "offset" that is added whenever you have missing values and a difference in dates greater than 1 day. This cumulative offset can be added to your feature
value to determine the finaloutput
.
dt %>%
mutate(offset = cumsum(is.na(feature) & Date - lag(Date) > 1)) %>%
fill(feature, .direction = "down") %>%
mutate(finaloutput = feature + offset)
Output
# A tibble: 21 x 4
Date feature offset finaloutput
<date> <dbl> <int> <dbl>
1 2011-06-01 1 0 1
2 2011-06-02 1 0 1
3 2011-06-03 1 0 1
4 2011-06-04 1 0 1
5 2011-06-05 1 0 1
6 2011-06-06 1 0 1
7 2011-06-07 1 0 1
8 2011-06-08 1 0 1
9 2011-06-09 1 0 1
10 2011-06-13 1 1 2
11 2011-06-14 1 1 2
12 2011-06-15 1 1 2
13 2011-06-16 1 1 2
14 2011-06-17 1 1 2
15 2011-06-18 2 1 3
16 2011-06-19 2 1 3
17 2011-06-20 2 1 3
18 2011-06-21 2 1 3
19 2011-06-22 2 1 3
20 2011-06-23 2 1 3
21 2011-06-24 2 1 3
Edit : With the second example dt2
that begins with NA
, you can try the following.
First, you can add a default
for lag
. In the case where the first row is NA
, it will evaluate for a difference in Date
. Since there is no prior Date
to compare with, you can use a default of more than 1 day, so that an offset will be added and these initial NA
will be considered the "first" feature
.
The second issue is filling in the NA
when you can't fill
in the down direction (no prior feature
value when it starts with NA
). You can just replace these with 0. Given the offset
, this will become finaloutput
of 0 + 1 = 1.
dt2 %>%
mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
fill(feature, .direction = "down") %>%
replace_na(list(feature = 0)) %>%
mutate(finaloutput = feature + offset)
Output
Date feature offset finaloutput
<date> <dbl> <int> <dbl>
1 2007-06-11 0 1 1
2 2007-06-12 0 1 1
3 2007-06-13 0 1 1
4 2007-06-14 0 1 1
5 2007-06-15 0 1 1
6 2007-06-25 1 1 2
7 2007-06-26 1 1 2
8 2007-06-27 1 1 2
9 2007-06-28 1 1 2
10 2007-06-29 1 1 2
11 2007-06-30 1 1 2
12 2007-07-01 2 1 3
Edit : With additional comment, there is an additional criterion to consider.
If the difference in Date
is > 1 and there are only 2 NA
, the first NA
should be filled by the previous feature, and the second by the following feature. In particular, the second of 2 NA
where there is a gap should be dealt with differently.
One approach to this is to count the number of consecutive NA
in a row. Then, feature
can be filled in for this particular circumstance, where the second of two NA
is identified with a Date
gap.
dt3 %>%
mutate(grp = cumsum(c(1, abs(diff(is.na(feature))) == 1))) %>%
add_count(grp) %>%
ungroup %>%
mutate(feature = ifelse(is.na(feature) & n == 2 & is.na(lag(feature)), lead(feature), feature)) %>%
mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
fill(feature, .direction = "down") %>%
replace_na(list(feature = 0)) %>%
mutate(finaloutput = feature + offset)
Output
Date feature grp n offset finaloutput
<date> <dbl> <dbl> <int> <int> <dbl>
1 1997-07-21 1 1 7 0 1
2 1997-07-22 1 1 7 0 1
3 1997-07-23 1 1 7 0 1
4 1997-07-24 1 1 7 0 1
5 1997-07-25 1 1 7 0 1
6 1997-07-26 1 1 7 0 1
7 1997-07-27 1 1 7 0 1
8 1997-07-28 1 2 2 0 1
9 1997-08-06 2 2 2 0 2
10 1997-08-07 2 3 9 0 2
11 1997-08-08 2 3 9 0 2
12 1997-08-09 2 3 9 0 2
13 1997-08-10 2 3 9 0 2
14 1997-08-11 2 3 9 0 2
15 1997-08-12 2 3 9 0 2
16 1997-08-13 2 3 9 0 2
17 1997-08-14 2 3 9 0 2
18 1997-08-15 2 3 9 0 2
19 1997-08-16 2 4 1 0 2
Note that this could be simplified; but before doing so, will need to be sure this meets your needs.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.