I have a dataframe here:
df <- data.frame("Time" = 1:10, "Value" = c(1.7,NA,-999,-999,1.5,1.6,NA,4,-999,8))
"NA" means there is no observation, just leave them there. "-999" means the observation is identified as an outlier.
Now I am trying to replace the "-999" with the average of the nearest values. For example:
The first "-999" should be replaced with (1.7+1.5)/2 = 1.6
The second "-999" should be replaced with (1.7+1.5)/2 = 1.6
The last "-999" should be replaced with (4.0+8.0)/2 = 6
I tried to use next
statement to find the next iteration, and use if
statement to decide where to stop. But how can I go up to check the previous iterations? Or is there just another kind of solution to this?
Many thanks.
Using a few while
loops, which bump up how far we lag/lead, we can accomplish this. I am not sure how performant this operation will be on large data sets. But it seems to get the job done for your sample data.
# find where replacements and initialize
where_to_replace <- which(df$Value == -999)
len_replace <- length(where_to_replace)
lag_value <- rep(NA, len_replace)
lead_value <- rep(NA, len_replace)
# more initializing
i <- 1
lag_n <- 1
lead_n <- 1
while(i <= len_replace){
# find appropriate lagged value
# can't use NA or lag value == -999
while(is.na(lag_value[i]) | lag_value[i] == -999){
lag_value[i] <- dplyr::lag(df$Value, lag_n)[where_to_replace[i]]
lag_n <- lag_n + 1
}
# find appropriate lead value
# can't use NA or -999 as lead value
while(is.na(lead_value[i]) | lead_value[i] == -999){
lead_value[i] <- dplyr::lead(df$Value, lead_n)[where_to_replace[i]]
lead_n <- lead_n + 1
}
# reset iterators
i <- i + 1
lag_n <- 1
lead_n <- 1
}
# replacement value
df$Value[where_to_replace] <- (lead_value + lag_value) / 2
# Time Value
# 1 1 1.7
# 2 2 NA
# 3 3 1.6
# 4 4 1.6
# 5 5 1.5
# 6 6 1.6
# 7 7 NA
# 8 8 4.0
# 9 9 6.0
# 10 10 8.0
One approach utilizing dplyr
, purrr
and tidyr
could be:
df %>%
mutate(New_Value = if_else(Value == -999,
map_dbl(.x = seq_along(Value),
~ mean(c(tail(na.omit(na_if(Value[1:(.x - 1)], -999)), 1),
head(na.omit(na_if(Value[(.x + 1):n()], -999)), 1)))),
Value))
Time Value New_Value
1 1 1.7 1.7
2 2 NA NA
3 3 -999.0 1.6
4 4 -999.0 1.6
5 5 1.5 1.5
6 6 1.7 1.7
7 7 NA NA
8 8 4.0 4.0
9 9 -999.0 6.0
10 10 8.0 8.0
I created two new helper colums - before and after. Before fills every NA and -999 with the next value on top and after fills NAs and -999 with the next value underneath. In the next step I over wrote each -999 with the mean of the two values.
df <- data.frame(Time = 1:10,
Value = c(1.7, NA, -999, -999, 1.5,
1.6, NA,
4, -999, 8))
df <- df %>%
mutate(before = recode(Value, `-999` = NA_real_),
after = recode(Value, `-999` = NA_real_)) %>%
fill(before, .direction = "down") %>%
fill(after, .direction = "up") %>%
mutate(Value = case_when(Value == -999 ~ (before + after)/2,
TRUE ~ Value)) %>%
select(Time, Value)
The Output
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0
Here is a base R option using findInterval
x <- which(df$Value == -999)
y <- setdiff(which(!is.na(df$Value)),x)
ind <- findInterval(x,y)
dfout <- within(df,Value <- replace(Value,x,rowMeans(cbind(Value[y[ind]],Value[y[ind+1]]))))
such that
> dfout
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0
Just sticking with base R data.frames we can make a function and use sapply
over indices of interest.
outliers <- df$Value == -999 # Keep as logical for now
fillers <- which(!is.na(df$Value) & !outliers)
outliers <- which(outliers) # Now convert to indices; FALSE and NA do not appear
filled_outliers <- sapply(outliers, function(x) {
before_ind = max(fillers[fillers < x]) # maximum INDEX before an outlier
after_ind = min(fillers[fillers > x])
0.5*(df$Value[before_ind] + df$Value[after_ind])
})
df[outliers, ] <- filled_outliers
df
Gives:
Time Value
1 1.0 1.7
2 2.0 NA
3 1.6 1.6
4 1.6 1.6
5 5.0 1.5
6 6.0 1.6
7 7.0 NA
8 8.0 4.0
9 6.0 6.0
10 10.0 8.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.