dplyr group_by 和样本同时忽略 NA

Question

I would like to gapfill NA values for each group by sampling non NA values from the same group.我想通过从同一组中采样非 NA 值来填充每个组的 NA 值。

This is the closest to what I'd like to achieve using !is.na() Ignoring values or NAs in the sample function .这是最接近我想要使用!is.na() Ignoring values or !is.na() in the sample function 实现的。


> dput(data)
structure(list(len = c(NA, 45447.4157838775, 161037.71538108, 
78147.8550470324, 7193.48815617057, 1571.95459212405, 18191.381972185, 
20366.2132412031, 10014.987524596, 1403.72511829297, 5651.17842991513, 
6848.03271105711, 8043.32937011393, 8926.65133418451, 5808.44456603825, 
2208.14264175252, 1797.4936747033, 5325.76651327694, 2660.66730207955, 
5844.07912541444, 3956.40473896271, 959.873314407621, 3294.01472360025, 
5221.94864001864, 3781.51913857335, 7811.83819953768, 3387.20323328623, 
5514.92099458441, 5792.54371531706, 5643.98385143961, 15478.916809379, 
8401.66533205217, 7046.25074819247, 2734.73639821402, NA, 62332.3343404513, 
NA, 46563.1214718113, 25590.4020105238, 13015.3682275862, 4984.80432801441, 
NA), point = c(NA, 0, 8, 5, 2, 0, 9, 0, 0, 0, 3, 1, 0, 6, 1, 
1, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, NA, 
10, NA, 19, 6, 5, 0, NA), country = structure(c(1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 
3L, 2L, 2L, 2L, 2L, 1L), .Label = c("WCY_____ES", "WCY_____FR", 
"WCY_____IT"), class = "factor"), group = c(1L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L, 4L)), row.names = c(NA, -42L), class = "data.frame")

library(dplyr)

data1 <- data %>% 
  group_by(group) %>%
  mutate(nulen = if_else(country == 'WCY_____FR', len, sample(len[!is.na(len)], 1, TRUE)),
         nupoint = if_else(country == 'WCY_____FR', point, sample(point[!is.na(point)], 1, TRUE)))

But instead I get Error in sample.int(length(x), size, replace, prob) : invalid first argument但是我Error in sample.int(length(x), size, replace, prob) : invalid first argument得到Error in sample.int(length(x), size, replace, prob) : invalid first argument

There should be no significant difference between the known and gapfilled distributions.已知分布和间隙填充分布之间应该没有显着差异。 If there are no values to sample from the same group (either other values are NA or there is only one row in the ```group``) then the sample should be taken from the entire dataset.如果没有要从同一组中采样的值（其他值为NA或 ```group`` 中只有一行），则应从整个数据集中抽取样本。 Any package is fine.任何包裹都可以。

Answer 1

Here is an idea,这里有一个想法，

dd %>%
    mutate(len1 = replace(len, is.na(len), sample(len[!is.na(len)], 1, TRUE)),
           point1 = replace(point, is.na(point), sample(point[!is.na(point)], 1, TRUE))) %>%
    group_by(group) %>% 
    mutate(nulen = ifelse(all(is.na(len)) & country == 'WCY_____FR', len1, 
                          ifelse(is.na(len) & country == 'WCY_____FR', sample(len[!is.na(len)], 1, TRUE), len)))

which gives,这使，

 len point country group len1 point1 nulen <dbl> <dbl> <fct> <int> <dbl> <dbl> <dbl> 1 NA NA WCY_____ES 1 1572. 0 NA 2 45447. 0 WCY_____FR 2 45447. 0 45447. 3 161038. 8 WCY_____FR 2 161038. 8 161038. 4 78148. 5 WCY_____FR 2 78148. 5 78148. 5 7193. 2 WCY_____FR 3 7193. 2 7193. 6 1572. 0 WCY_____FR 3 1572. 0 1572. 7 18191. 9 WCY_____FR 3 18191. 9 18191. 8 20366. 0 WCY_____FR 3 20366. 0 20366. 9 10015. 0 WCY_____FR 3 10015. 0 10015. 10 1404. 0 WCY_____FR 3 1404. 0 1404. # ... with 32 more rows

The same can be done for point as well. point也可以这样做。

dplyr group_by 和样本同时忽略 NA

问题描述

1 个解决方案

解决方案1
0 2019-08-20 11:38:24

dplyr group_by 和样本同时忽略 NA

问题描述

1 个解决方案

解决方案1 0 2019-08-20 11:38:24

解决方案1
0 2019-08-20 11:38:24