简体   繁体   English

用组内的值替换 NA 子集

[英]Replace NA with value within group for subset

I need to replace missing values in all columns of a data frame within ID and time point for a subgroup that have data from several sources.我需要替换 ID 和时间点内数据框所有列中的缺失值,用于具有来自多个来源的数据的子组。 If it is not too complicated, it would be best to prioritize data from source B (eg, in case of id 2 for variable Y in the data below).如果不是太复杂,最好优先考虑来自源 B 的数据(例如,在下面数据中变量 Y 的 id 为 2 的情况下)。

Using the code below, it currently works (without prioritizing) for one column at the time, but since its a large data frame with millions of rows, it needs to be automated further.使用下面的代码,它目前适用于(没有优先级)当时的一列,但由于它是一个包含数百万行的大型数据框,因此需要进一步自动化。 Also, I would like to keep it within the data.table framework if possible.另外,如果可能的话,我想将它保留在 data.table 框架中。 Any advice?有什么建议吗?

# Data
id  time  X  Y   Source
1   2005  67 NA  A
1   2005  NA 1.1 B
1   2005  NA 1.1 B
2   2003  85 NA  B
2   2003  NA 0.4 A
2   2003  85 0.5 B

# Desired output
id  time  X  Y   Source
1   2005  67 1.1 A
1   2005  67 1.1 B
1   2005  67 1.1 B
2   2003  85 0.5 B
2   2003  85 0.4 A
2   2003  85 0.5 B

# Find duplicates
dup <- (duplicated(dat[,c('id','time')])|duplicated(dat[,c('id','time')], fromLast=TRUE))

# Replace NA in column X
library(data.table)
dat[dup & is.na(X), X := dat[!is.na(X)][.SD, on=.(id,time), mult="last", X]]

### Solution based on locf and an internal data.table loop (still slower than tidyverse)

    library(data.table)
    library(zoo)

    cols <- colnames(dat)[c(-1,-2)]
    dat <- dat[order(id,time,Source)] # this combined with na.locf0(fromLast=T) takes care of the priority.
    dup <- (duplicated(dat[,c('id','time')])|duplicated(dat[,c('id','time')], fromLast=TRUE))

    t1 <- Sys.time() 
      dat=rbind(
        dat[!dup],
        dat[dup, lapply(.SD, na.locf0,fromLast = TRUE), by=c('id','time'), .SDcols = cols][
            ,lapply(.SD, na.locf0), by=c('id','time'), .SDcols = cols]
      )
    t2 <- Sys.time()
    t2-t1
library(tidyverse)
library(data.table)

Data <- data.table(id = c(1,1,1,2,2,2), time = c(2005, 2005, 2005, 2003, 2003, 2003), X = c(67, NA, NA, 85, NA, 85),
                       Y = c(NA, 1.1, 1.1, NA, 0.4, 0.5), Source = c("A", "B", "B", "B", "A", "B"))

Data <- Data %>% 
  group_by(id, Source) %>% 
  fill(time, X, Y) %>%
  fill(time, X, Y, .direction = "up")

Data <- Data %>% 
  group_by(id) %>% 
  fill(time, X, Y) %>%
  fill(time, X, Y, .direction = "up")

I am not sure if you mean that source "B" is always preferred or if it is only preferred when the source of the sample is also "B" (and thus the preferred source would be "A" if that sample's source was "A").我不确定您的意思是始终首选源“B”还是仅当样本来源也是“B”时才首选它(因此,如果该样本的来源是“A”,则首选来源将是“A” ”)。 This code solves the issue for the latter scenario.此代码解决了后一种情况的问题。 It requires tidyverse.它需要tidyverse。

Here are 3 options:这里有 3 个选项:

1) Using a for loop with get : 1) 在get使用for循环:

for (x in updcols) {
    DT0[dup & is.na(get(x)), (x) := DT0[!is.na(get(x))][
        .SD, on=.(id,time), mult="last", get(x)]]   
}
DT0

2) Using for loop with non-standard evaluation: 2) 使用for循环进行非标准评估:

nsef <- function(dat, coln) {
    eval(substitute(
        dat[dup & is.na(V), V := dat[!is.na(V)][.SD, on=.(id,time), mult="last", V]],
        list(V=as.name(coln))
    ))
}
for (x in updcols) {
    nsef(DT1, x)
}
DT1

3) Extract the last non-NA values and perform a join and then update by reference: 3) 提取最后一个非 NA 值并执行连接,然后通过引用更新:

lu <- DT2[, lapply(.SD, function(x) last(x[!is.na(x)])), bycols, .SDcols=updcols]
DT2[(dup), (updcols) := 
    lu[.SD, on=bycols, Map(function(x, y) fcoalesce(x, y), 
        mget(paste0("i.", updcols)), mget(updcols))]
]
DT2

You can also use fifelse (version >= 1.12.4) instead of fcoalesce (ie fcoalesce(X, Y) == fifelse(is.na(X), Y, X) ).您还可以使用fifelse (版本 >= 1.12.4)代替fcoalesce (即fcoalesce(X, Y) == fifelse(is.na(X), Y, X) )。

I think timing will depend on the characteristics of your actual dataset.我认为时间将取决于您的实际数据集的特征。

output:输出:

   id time  X   Y Source
1:  1 2005 67 1.1      A
2:  1 2005 67 1.1      B
3:  1 2005 67 1.1      B
4:  2 2003 85 0.5      B
5:  2 2003 85 0.4      A
6:  2 2003 85 0.5      B

data:数据:

library(data.table) #data.table_1.12.6
DT <- fread("id  time  X  Y   Source
1   2005  67 NA  A
1   2005  NA 1.1 B
1   2005  NA 1.1 B
2   2003  85 NA  B
2   2003  NA 0.4 A
2   2003  85 0.5 B")
DT0 <- copy(DT)
DT1 <- copy(DT)
DT2 <- copy(DT)
bycols <- c('id','time')
updcols <- c("X", "Y")
dup <- duplicated(DT, by=bycols) | duplicated(DT, by=bycols, fromLast=TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM