简体   繁体   中英

Conditionally assign values to NAs in a column based on comparison of multiple observations in another column, within a grouped data frame

Below is my example data frame (code and output), which includes the relevant columns from my actual data frame:

example <- data.frame(contig=c("Contig1", "Contig1", "Contig1", "Contig1", "Contig1", "Contig2", "Contig2", "Contig2", "Contig2", "Contig2", "Contig2", "Contig2", "Contig3", "Contig3", "Contig3", "Contig3", "Contig3", "Contig3", "Contig3", "Contig3"),
                  pos=c(500, 650, 750, 1000, 2000, 500, 4100, 5000, 5300, 6100, 6400, 7500, 600, 3800, 4500, 5000, 5500, 6100, 7000, 8000),
                  av=c(NA, 12, NA, NA, NA, NA, NA, 20, NA, NA, 25, NA, NA, 55, NA, NA, NA, 56, NA, NA))

示例数据框

Currently only some observations have a value for av whereas many are NA . I would like to assign values of av to replace the NA s, and have two different, separate methods that I would like to use to do this, so that I can compare the results of the two methods later, but I don't know how to implement either method.

First, I would like to replace NA s such that, within a contig (ie the data frame should be grouped by contig ), if the pos of an observation with an NA for av is within 1000 of the pos of an observation with an av value, then the NA will be replaced by that value of av . Any NA s without a pos within 1000 of another pos (with an av value) on the same contig will remain as NA .

Second, I would like to replace NA s without the condition of the pos being within 1000 of a pos with an av value, but still within contig groups. Many contig groups will only have one observation with an av value, so this av value can replace all the NA s within that contig group (I think na.locf() will do this). However, some contig groups have more than one observation with an av value, so for those I would like to assign the NA s the av value of the observation with the pos closer to its own pos value.

Below are the desired outputs of the two methods for the example data frame.

Method 1

method1输出

Method 2

method2输出

Just put in the dataframe to impute. Change the method argument to "method1" or "method2". If the dataframe is not the same structure it will not work as I have references the columns by their index eg 1 for contig, 2 for pos and 3 for av.

impute_av = function(df, method){

  sapply(1:nrow(df), function(i){

    if(is.na(df[i,3])){

      if(method == "method1"){
        y = df[df[,1] == df[i,1] & df[,2] < df[i,2] + 1000 & df[,2] > df[i,2] - 1000, 2:3, drop=F]
      } else if(method == "method2"){
        y = df[df[,1] == df[i,1], 2:3, drop=F]
      }

      y = y[!is.na(y[,2]),,drop=F]

      if(nrow(y) == 0){
        df[i,3]
      } else {
        y[which.min(abs(y[,1] - df[i,2])), 2]
      }

    } else df[i,3]

  })

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM