简体   繁体   中英

dplyr group_by + mutate strange NA appearing

I have a data.frame like this

datdf  <- structure(list(BM = rep("1907-01-01", 20), 
                         ct = structure(rep(c(1L, 2L), each = 5, times = 2), 
                                        .Label = c("B", "A"), class = "factor"), 
                         val = c(rep(NA, 10), 9901:9910), 
                         facet = rep(c(1, 2), each = 10) ), 
                    row.names = c(NA, -20L), 
                    .Names = c("BM", "ct", "val", "facet"), 
                    class = c("tbl_df", "tbl", "data.frame"))

My problem is following. After making some groupwise mutation (I need cumsum ) I get NA values in one of the groups. And it's not only cumsum - any modification of val throws NA .

datdf %>% group_by(BM, facet, ct) %>% mutate(v1 = val + 100, v2 = cumsum(val), v3 = val)

#            BM     ct   val facet    v1    v2    v3
#         (chr) (fctr) (int) (dbl) (dbl) (int) (int)
# 11 1907-01-01      B  9901     2 10001  9901  9901
# 12 1907-01-01      B  9902     2 10002 19803  9902
# 13 1907-01-01      B  9903     2 10003 29706  9903
# 14 1907-01-01      B  9904     2 10004 39610  9904
# 15 1907-01-01      B  9905     2 10005 49515  9905
# 16 1907-01-01      A  9906     2    NA    NA  9906
# 17 1907-01-01      A  9907     2    NA    NA  9907
# 18 1907-01-01      A  9908     2    NA    NA  9908
# 19 1907-01-01      A  9909     2    NA    NA  9909
# 20 1907-01-01      A  9910     2    NA    NA  9910

My dplyr version is 0.4.3, R is 3.1.3

Is it a bug or am I missing something? I remember not having this issue with dplyr 0.4.1 before having updated it some weeks ago.

How can I fix it now?

A workaround is to use the function mapvalues from plyr to replace NAs by zeros:

Just for the v2 (cumsum column):

library(plyr)   
datdf %>%  mutate(v1 = val + 100, 
                       v2 = cumsum(val %>% mapvalues(NA, 0)), 
                       v3 = val)

Output:

           BM     ct   val facet    v1    v2    v3
        (chr) (fctr) (int) (dbl) (dbl) (dbl) (int)
1  1907-01-01      B    NA     1    NA     0    NA
2  1907-01-01      B    NA     1    NA     0    NA
3  1907-01-01      B    NA     1    NA     0    NA
4  1907-01-01      B    NA     1    NA     0    NA
5  1907-01-01      B    NA     1    NA     0    NA
6  1907-01-01      A    NA     1    NA     0    NA
7  1907-01-01      A    NA     1    NA     0    NA
8  1907-01-01      A    NA     1    NA     0    NA
9  1907-01-01      A    NA     1    NA     0    NA
10 1907-01-01      A    NA     1    NA     0    NA
11 1907-01-01      B  9901     2 10001  9901  9901
12 1907-01-01      B  9902     2 10002 19803  9902
13 1907-01-01      B  9903     2 10003 29706  9903
14 1907-01-01      B  9904     2 10004 39610  9904
15 1907-01-01      B  9905     2 10005 49515  9905
16 1907-01-01      A  9906     2 10006 59421  9906
17 1907-01-01      A  9907     2 10007 69328  9907
18 1907-01-01      A  9908     2 10008 79236  9908
19 1907-01-01      A  9909     2 10009 89145  9909
20 1907-01-01      A  9910     2 10010 99055  9910

For all columns:

datdf %>%   mutate(v1 = val  %>% mapvalues(NA, 0) + 100, 
                   v2 = cumsum(val %>% mapvalues(NA, 0)), 
                   v3 = val %>% mapvalues(NA, 0))

Output:

           BM     ct   val facet    v1    v2    v3
        (chr) (fctr) (int) (dbl) (dbl) (dbl) (dbl)
1  1907-01-01      B    NA     1   100     0     0
2  1907-01-01      B    NA     1   100     0     0
3  1907-01-01      B    NA     1   100     0     0
4  1907-01-01      B    NA     1   100     0     0
5  1907-01-01      B    NA     1   100     0     0
6  1907-01-01      A    NA     1   100     0     0
7  1907-01-01      A    NA     1   100     0     0
8  1907-01-01      A    NA     1   100     0     0
9  1907-01-01      A    NA     1   100     0     0
10 1907-01-01      A    NA     1   100     0     0
11 1907-01-01      B  9901     2 10001  9901  9901
12 1907-01-01      B  9902     2 10002 19803  9902
13 1907-01-01      B  9903     2 10003 29706  9903
14 1907-01-01      B  9904     2 10004 39610  9904
15 1907-01-01      B  9905     2 10005 49515  9905
16 1907-01-01      A  9906     2 10006 59421  9906
17 1907-01-01      A  9907     2 10007 69328  9907
18 1907-01-01      A  9908     2 10008 79236  9908
19 1907-01-01      A  9909     2 10009 89145  9909
20 1907-01-01      A  9910     2 10010 99055  9910

Maybe you ran into some of this issues: https://github.com/hadley/dplyr/issues/1448#issuecomment-150037548

try this:

datdf %>% group_by(BM, facet,ct) %>% plyr::mutate(v1 = val + 100, v2 = cumsum(val[!is.na(val)]), v3 = val)

               BM     ct   val facet    v1    v2    v3
            (chr) (fctr) (int) (dbl) (dbl) (int) (int)
    11 1907-01-01      B  9901     2 10001  9901  9901
    12 1907-01-01      B  9902     2 10002 19803  9902
    13 1907-01-01      B  9903     2 10003 29706  9903
    14 1907-01-01      B  9904     2 10004 39610  9904
    15 1907-01-01      B  9905     2 10005 49515  9905
    16 1907-01-01      A  9906     2 10006 59421  9906
    17 1907-01-01      A  9907     2 10007 69328  9907
    18 1907-01-01      A  9908     2 10008 79236  9908
    19 1907-01-01      A  9909     2 10009 89145  9909
    20 1907-01-01      A  9910     2 10010 99055  9910

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM