Below is my example data frame (code and output), which includes the relevant columns from my actual data frame:
example <- data.frame(contig=c("Contig1", "Contig1", "Contig1", "Contig1", "Contig1", "Contig2", "Contig2", "Contig2", "Contig2", "Contig2", "Contig2", "Contig2", "Contig3", "Contig3", "Contig3", "Contig3", "Contig3", "Contig3", "Contig3", "Contig3"),
pos=c(500, 650, 750, 1000, 2000, 500, 4100, 5000, 5300, 6100, 6400, 7500, 600, 3800, 4500, 5000, 5500, 6100, 7000, 8000),
av=c(NA, 12, NA, NA, NA, NA, NA, 20, NA, NA, 25, NA, NA, 55, NA, NA, NA, 56, NA, NA))
Currently only some observations have a value for av
whereas many are NA
. I would like to assign values of av
to replace the NA
s, and have two different, separate methods that I would like to use to do this, so that I can compare the results of the two methods later, but I don't know how to implement either method.
First, I would like to replace NA
s such that, within a contig
(ie the data frame should be grouped by contig
), if the pos
of an observation with an NA
for av
is within 1000 of the pos
of an observation with an av
value, then the NA
will be replaced by that value of av
. Any NA
s without a pos
within 1000 of another pos
(with an av
value) on the same contig
will remain as NA
.
Second, I would like to replace NA
s without the condition of the pos
being within 1000 of a pos
with an av
value, but still within contig
groups. Many contig
groups will only have one observation with an av
value, so this av
value can replace all the NA
s within that contig
group (I think na.locf()
will do this). However, some contig
groups have more than one observation with an av
value, so for those I would like to assign the NA
s the av
value of the observation with the pos
closer to its own pos
value.
Below are the desired outputs of the two methods for the example data frame.
Method 1
Method 2
Just put in the dataframe to impute. Change the method argument to "method1" or "method2". If the dataframe is not the same structure it will not work as I have references the columns by their index eg 1 for contig, 2 for pos and 3 for av.
impute_av = function(df, method){
sapply(1:nrow(df), function(i){
if(is.na(df[i,3])){
if(method == "method1"){
y = df[df[,1] == df[i,1] & df[,2] < df[i,2] + 1000 & df[,2] > df[i,2] - 1000, 2:3, drop=F]
} else if(method == "method2"){
y = df[df[,1] == df[i,1], 2:3, drop=F]
}
y = y[!is.na(y[,2]),,drop=F]
if(nrow(y) == 0){
df[i,3]
} else {
y[which.min(abs(y[,1] - df[i,2])), 2]
}
} else df[i,3]
})
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.