在dplyr链中替换NA

Question

Question has been edited from the original . 问题已从原始编辑 。

After reading this interesting discussion I was wondering how to replace NAs in a column using dplyr in, for example, the Lahman batting data: 在阅读了这个有趣的讨论后，我想知道如何使用dplyr替换列中的NAs，例如Lahman击球数据：

Source: local data frame [96,600 x 3]
Groups: teamID

   yearID teamID G_batting
1    2004    SFN        11
2    2006    CHN        43
3    2007    CHA         2
4    2008    BOS         5
5    2009    SEA         3
6    2010    SEA         4
7    2012    NYA        NA

The following does not work as I expected 以下不能像我预期的那样工作

library(dplyr)
library(Lahman)

df <- Batting[ c("yearID", "teamID", "G_batting") ]
df <- group_by(df, teamID )
df$G_batting[is.na(df$G_batting)] <- mean(df$G_batting, na.rm = TRUE)

Source: local data frame [20 x 3] Groups: yearID, teamID 来源：本地数据框[20 x 3]组：yearID，teamID

   yearID teamID G_batting
1    2004    SFN  11.00000
2    2006    CHN  43.00000
3    2007    CHA   2.00000
4    2008    BOS   5.00000
5    2009    SEA   3.00000
6    2010    SEA   4.00000
7    2012    NYA  **49.07894**

> mean(Batting$G_battin, na.rm = TRUE)
[1] **49.07894**

In fact it imputed the overall mean and not the group mean. 实际上，它归咎于整体均值而不是群体均值。 How would you do this in a dplyr chain? 你会如何在dplyr链中做到这一点？ Using transform from base R also does not work as it imputed the overall mean and not the group mean. 使用transform从基础R，因为它估算总平均值，而不是组平均也不起作用 。 Also this approach converts the data to a regular dat. 此方法也将数据转换为常规数据。 a frame. 一个框架。 Is there a better way to do this? 有一个更好的方法吗？

df %.% 
  group_by( yearID ) %.%
  transform(G_batting = ifelse(is.na(G_batting), 
    mean(G_batting, na.rm = TRUE), 
    G_batting)
  )

Edit: Replacing transform with mutate gives the following error 编辑：用mutate替换transform会产生以下错误

Error in mutate_impl(.data, named_dots(...), environment()) : 
  INTEGER() can only be applied to a 'integer', not a 'double'

Edit: Adding as.integer seems to resolve the error and does produce the expected result. 编辑：添加as.integer似乎解决了错误并确实产生了预期的结果。 See also @eddi's answer. 另见@ eddi的答案。

df %.% 
  group_by( teamID ) %.%
  mutate(G_batting = ifelse(is.na(G_batting), as.integer(mean(G_batting, na.rm = TRUE)), G_batting))

Source: local data frame [96,600 x 3]
Groups: teamID

   yearID teamID G_batting
1    2004    SFN        11
2    2006    CHN        43
3    2007    CHA         2
4    2008    BOS         5
5    2009    SEA         3
6    2010    SEA         4
7    2012    NYA        47

> mean_NYA <- mean(filter(df, teamID == "NYA")$G_batting, na.rm = TRUE)
> as.integer(mean_NYA)
[1] 47

Edit: Following up on @Romain's comment I installed dplyr from github: 编辑：关注@ Romain的评论我从github安装了dplyr：

> head(df,10)
   yearID teamID G_batting
1    2004    SFN        11
2    2006    CHN        43
3    2007    CHA         2
4    2008    BOS         5
5    2009    SEA         3
6    2010    SEA         4
7    2012    NYA        NA
8    1954    ML1       122
9    1955    ML1       153
10   1956    ML1       153

> df %.% 
+   group_by(teamID)  %.%
+   mutate(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting))
Source: local data frame [96,600 x 3]
Groups: teamID

   yearID teamID  G_batting
1    2004    SFN          0
2    2006    CHN          0
3    2007    CHA          0
4    2008    BOS          0
5    2009    SEA          0
6    2010    SEA 1074266112
7    2012    NYA   90693125
8    1954    ML1        122
9    1955    ML1        153
10   1956    ML1        153
..    ...    ...        ...

So I didn't get the error (good) but I got a (seemingly) strange result. 所以我没有得到错误（好），但我得到了（看似）奇怪的结果。

Answer 1

The main issue you're having is that mean returns a double while the G_batting column is an integer. 你遇到的主要问题是，当G_batting列是一个整数时， mean返回一个double。 So wrapping the mean in as.integer would work, or you'd need to convert the entire column to numeric I guess. 因此，将该平均值包装在as.integer会起作用，或者您需要将整个列转换为numeric 。

That said, here are a couple of data.table alternatives - I didn't check which one is faster. 也就是说，这里有几个data.table替代方案 - 我没有检查哪一个更快。

library(data.table)

# using ifelse
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a]

# using a temporary column
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL]

And this is what I'd want to do ideally ( there is an FR about this): 这就是我想要理想的事情（有关于此的FR ）：

# again, atm this is pure fantasy and will not work
dt[, b[is.na(b)] := mean(b, na.rm = T), by = a]

The dplyr version of the ifelse is (as in OP): 该dplyr的版本ifelse是（在OP）：

dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))

I'm not sure how to implement the second data.table idea in a single line in dplyr . 我不知道如何实现第二data.table在单行想法dplyr 。 I'm also not sure how you can stop dplyr from scrambling/ordering the data (aside from creating an index column). 我也不确定如何阻止dplyr加扰/排序数据（除了创建索引列）。

在dplyr链中替换NA

问题描述

1 个解决方案

解决方案1
32 已采纳 2014-02-12 00:25:07

在dplyr链中替换NA

问题描述

1 个解决方案

解决方案1 32 已采纳 2014-02-12 00:25:07

解决方案1
32 已采纳 2014-02-12 00:25:07