简体   繁体   中英

the average of nearest neighbour values in a column in r

I have a dataframe here:

df <- data.frame("Time" = 1:10, "Value" = c(1.7,NA,-999,-999,1.5,1.6,NA,4,-999,8))

"NA" means there is no observation, just leave them there. "-999" means the observation is identified as an outlier.

在此处输入图片说明

Now I am trying to replace the "-999" with the average of the nearest values. For example:

The first "-999" should be replaced with (1.7+1.5)/2 = 1.6
The second "-999" should be replaced with (1.7+1.5)/2 = 1.6
The last "-999" should be replaced with (4.0+8.0)/2 = 6

I tried to use next statement to find the next iteration, and use if statement to decide where to stop. But how can I go up to check the previous iterations? Or is there just another kind of solution to this?

Many thanks.

Using a few while loops, which bump up how far we lag/lead, we can accomplish this. I am not sure how performant this operation will be on large data sets. But it seems to get the job done for your sample data.

# find where replacements and initialize
where_to_replace <- which(df$Value == -999)
len_replace <- length(where_to_replace)
lag_value <- rep(NA, len_replace)
lead_value <- rep(NA, len_replace)

# more initializing
i <- 1
lag_n <- 1
lead_n <- 1

while(i <= len_replace){
    # find appropriate lagged value
    # can't use NA or lag value == -999
    while(is.na(lag_value[i]) | lag_value[i] == -999){
        lag_value[i] <- dplyr::lag(df$Value, lag_n)[where_to_replace[i]]
        lag_n <- lag_n + 1
    }
    # find appropriate lead value
    # can't use NA or -999 as lead value
    while(is.na(lead_value[i]) | lead_value[i] == -999){
        lead_value[i] <- dplyr::lead(df$Value, lead_n)[where_to_replace[i]]
        lead_n <- lead_n + 1
    }
    # reset iterators   
    i <- i + 1
    lag_n <- 1
    lead_n <- 1
}
# replacement value
df$Value[where_to_replace] <- (lead_value + lag_value) / 2

#    Time Value
# 1     1   1.7
# 2     2    NA
# 3     3   1.6
# 4     4   1.6
# 5     5   1.5
# 6     6   1.6
# 7     7    NA
# 8     8   4.0
# 9     9   6.0
# 10   10   8.0

One approach utilizing dplyr , purrr and tidyr could be:

df %>%
 mutate(New_Value = if_else(Value == -999,
                            map_dbl(.x = seq_along(Value), 
                                    ~ mean(c(tail(na.omit(na_if(Value[1:(.x - 1)], -999)), 1),
                                             head(na.omit(na_if(Value[(.x + 1):n()], -999)), 1)))),
                            Value))

   Time  Value New_Value
1     1    1.7       1.7
2     2     NA        NA
3     3 -999.0       1.6
4     4 -999.0       1.6
5     5    1.5       1.5
6     6    1.7       1.7
7     7     NA        NA
8     8    4.0       4.0
9     9 -999.0       6.0
10   10    8.0       8.0

I created two new helper colums - before and after. Before fills every NA and -999 with the next value on top and after fills NAs and -999 with the next value underneath. In the next step I over wrote each -999 with the mean of the two values.

df <- data.frame(Time = 1:10, 
                 Value = c(1.7, NA, -999, -999, 1.5,
                           1.6, NA,
                           4, -999, 8))


df <- df %>%
  mutate(before = recode(Value, `-999` = NA_real_),
         after = recode(Value, `-999` = NA_real_)) %>%
  fill(before, .direction = "down") %>%
  fill(after, .direction = "up") %>%
  mutate(Value = case_when(Value == -999 ~ (before + after)/2,
                           TRUE ~ Value)) %>%
  select(Time, Value)

The Output


   Time Value
1     1   1.7
2     2    NA
3     3   1.6
4     4   1.6
5     5   1.5
6     6   1.6
7     7    NA
8     8   4.0
9     9   6.0
10   10   8.0

Here is a base R option using findInterval

x <- which(df$Value == -999)
y <- setdiff(which(!is.na(df$Value)),x)
ind <- findInterval(x,y)
dfout <- within(df,Value <- replace(Value,x,rowMeans(cbind(Value[y[ind]],Value[y[ind+1]]))))

such that

> dfout
   Time Value
1     1   1.7
2     2    NA
3     3   1.6
4     4   1.6
5     5   1.5
6     6   1.6
7     7    NA
8     8   4.0
9     9   6.0
10   10   8.0

Just sticking with base R data.frames we can make a function and use sapply over indices of interest.

outliers <- df$Value == -999 # Keep as logical for now
fillers <- which(!is.na(df$Value) & !outliers)
outliers <- which(outliers) # Now convert to indices; FALSE and NA do not appear
  
filled_outliers <- sapply(outliers, function(x) {
  before_ind = max(fillers[fillers < x]) # maximum INDEX before an outlier
  after_ind = min(fillers[fillers > x]) 
  
  0.5*(df$Value[before_ind] + df$Value[after_ind])
})
  
df[outliers, ] <- filled_outliers
  
df

Gives:

   Time Value
1   1.0   1.7
2   2.0    NA
3   1.6   1.6
4   1.6   1.6
5   5.0   1.5
6   6.0   1.6
7   7.0    NA
8   8.0   4.0
9   6.0   6.0
10 10.0   8.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM