[英]R data.table: replace missing values by group by value depending on number of missing values in group
我想为每个组替换我的 data.table 中的缺失值,并根据组中的所有值是否缺失或组中的某些值缺失来填充值。
我可以解决问题,但对更好的代码持开放态度(在速度/内存/可读性/灵活性方面)。
我很固执,我更喜欢 data.table 解决方案:)
它是一个具有以下结构的 data.table:
dt = data.table(
grouping_1 = sort(rep(c('a', 'b', 'c'), 4)),
grouping_2 = c(1,1,2,2,1,1,2,2,1,1,2,2),
value_1 = c(NA, NA, NA, NA, NA, 1, 2, NA, 3, 2,4,NA),
value_2 = c(NA, 2, NA, NA, 2, 5, 2, 7, 10, 5,NA, NA)
)
看起来像这样:
grouping_1 grouping_2 value_1 value_2
1: a 1 NA NA
2: a 1 NA 2
3: a 2 NA NA
4: a 2 NA NA
5: b 1 NA 2
6: b 1 1 5
7: b 2 2 2
8: b 2 NA 7
9: c 1 3 10
10: c 1 2 5
11: c 2 4 NA
12: c 2 NA NA
我想按列grouping_1
和grouping_2
对其进行分组,并替换列value_1
和value_2
中的缺失值。
如果给定组没有非缺失值(例如 group grrouping_1==a & grouping_2==1
),我想用 9000 的值替换该组的所有 NA。
如果给定组有一些非缺失值, if grouping_2==1
我想用 800 替换缺失值,如果 grouping_2== if grouping_2==2
用 -800(负 800)替换。 如果该值没有丢失,我不想更改它。
我写了以下 function,然后将其应用于我要填写缺失值的每一列。 function 通过引用更改原始数据集:
filler_so = function(
data, # the dataset that we will be changing
column, # the column we will be filling in
placeholder_col ='drop_at_the_end', # some temporary column that will disappear in the end
missing_fully = 9000, # value to fill in when all values in group missing
missing_partially_g2_1 = 800, # value to fill when grouping_2 = 1
missing_partially_g2_2 = -800, # value to fill when grouping_2 = 2
g2_col = 'grouping_2', # name of column corresponding to grouping_2 from my example
group_cols = c('grouping_1', 'grouping_2') # names of columns to group by
){
# identify for given column whether all values in group are missing,
# or only some are misisng. The value will be either Infinity (all missig),
# or a real number (none or some missing).
# this info is put in a placeholder column
data[, (placeholder_col) := min(get(column), na.rm = T), by = group_cols]
# if value on a given row is missing, but not all missing in group,
# then fill in the values based on what group is in 2nd grouping column
data[
is.na(get(column)) & (get(placeholder_col) != Inf),
(placeholder_col) := (get(g2_col) == 2) * missing_partially_g2_2 +
(get(g2_col) ==1) * missing_partially_g2_1]
# if all values in group are missing, fill in the "missing_fully" value
data[get(placeholder_col) == Inf, (placeholder_col) := missing_fully]
# put into placeholder column the values that were originally not missing
data[!is.na(get(column)), (placeholder_col) := get(column)]
# drop the original column
data[, (column):=NULL]
# rename the placeholder column to the name of original column
setnames(data, placeholder_col, column)
# if i don't put this here,
# then sometimes the function doesn't return results properly.
# i have no clue why.
data
}
要应用此 function 我需要确定要填充的列,我这样做是这样的:
cols_to_fill = colnames(dt)[grep('^value', colnames(dt))]
像这样 lapply:
lapply(cols_to_fill, function(x) filler_so(dt, x))
> dt
grouping_1 grouping_2 value_1 value_2
1: a 1 9000 800
2: a 1 9000 2
3: a 2 9000 9000
4: a 2 9000 9000
5: b 1 800 2
6: b 1 1 5
7: b 2 2 2
8: b 2 -800 7
9: c 1 3 10
10: c 1 2 5
11: c 2 4 9000
12: c 2 -800 9000
grouping_2
填充的值dt[..., (some_column_names):= lapply(.SD, ...), .SDcols = cols_to_fill]
尝试:
replace_NA <- function(v,grouping_2) {
na_v = is.na(v)
if (sum(na_v) == length(v)) {
return(rep(9000,length(v)))
} else {
v[na_v] <- ifelse(grouping_2 == 1, 800,-800)
return(v)
}
}
dt[, c("v1_new","v2new") :=.( replace_NA(value_1,grouping_2),
replace_NA(value_2,grouping_2))
,by=.(grouping_1,grouping_2)]
这仍然很冗长,但使用.SDcols
:
library(data.table)
cols <- grep('^value', colnames(dt), value = TRUE)
dt[, (cols) := lapply(.SD, function(x) {
#Check NA values once
tmp <- is.na(x)
#If no non-NA value
if(all(tmp)) return(9000)
#If some missing values
if(any(tmp)) {
#If grouping2 is 1
if(first(grouping_2) == 1)
replace(x, tmp, 800)
else
replace(x, tmp, -800)
}
else x
}), .(grouping_1, grouping_2), .SDcols = cols]
dt
# grouping_1 grouping_2 value_1 value_2
# 1: a 1 9000 800
# 2: a 1 9000 2
# 3: a 2 9000 9000
# 4: a 2 9000 9000
# 5: b 1 800 2
# 6: b 1 1 5
# 7: b 2 2 2
# 8: b 2 -800 7
# 9: c 1 3 10
#10: c 1 2 5
#11: c 2 4 9000
#12: c 2 -800 9000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.