简体   繁体   中英

Fill the gaps, depending on the length of missing values and the last & previous known value in R

consider following dataset:

df<-data.frame(ID=c(1,2), Value_1=c(1,7), Value_2= c(NA,10), Value_3=c(NA,13), Value_4=c(7,NA))

What I would like to achieve is this:

df_target<-data.frame(ID=c(1,2), Value_1=c(1,7), Value_2= c(3,10), Value_3=c(5,13), Value_4=c(7,16))

As you can see here we have two diffrent issues:

  1. In the first column I would like to do the following operation: "(last_know + previous_know)/number_of_elements" and add this number to the last known value, proceed until you reach the last value: ie (1+7)/4=2 --> 1; 1+2; 1+2+2; 7
  2. The secound one is to do lm() to predict the last value.

but how to combine this? Especially the first case is the most challenging part. I guess it should be done with median(last_known, previous_known), and then somehow count the missing values, and map it to the na_count_id and than add to the multiplication of mean and the corresponding na_count_id:

previous_known_value + na_count_id*median 

Thanks in advance for your help!

Here is a solution that works. This should work even if there is an NA in the first column, based on testing I did. Basically, I iterate over every row by column. The increaser variable is the amount by which the column must be increased over the previous column to get the pattern you are looking to achieve.

library(tidyverse)
df <- column_to_rownames(df, var = "ID") # need to convert ID column to rownames
for(i in 1:nrow(df)){
  increaser <- as.numeric((range(df[i,], na.rm = TRUE)[2] - range(df[i,], na.rm = TRUE)[1])/(which.max(df[i,]) - which.min(df[i,]))) # increaser is calculated by taking the range of the row and dividing by the difference between the indices of the max and min of the row
  for(j in 1:ncol(df)){ # this iterates through every column
    if(is.na(df[i,j])){ 
      if(j == 1){ # special calculation needed for first column since there's no previous column to increase by
        df[i, j] <- df[i, min(which(!is.na(df[i,])))] - increaser*(min(which(!is.na(df[i,])))-j) # this finds the next non NA column for that row, and subtracts that next non-NA column from the difference in the index positions multiplied by the increaser
      } else {
        df[i, j] <- df[i, j-1] + increaser # this is for an NA position which is not in the first column
      }
    } else {
      df[i, j] <- df[i, j] # if a position is not NA, no calculations needed
    }
  }
}

# this loop returns the following. You can convert the row ID back to a column if desired.
# Value_1 Value_2 Value_3 Value_4
#1       1       3       5       7
#2       7      10      13      16

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM