简体   繁体   English

dplyr汇总中的NA案件总数

[英]Sum NA cases in dplyr's summarise

I can't find what am I doing wrong summarising values with value and with NA. 我找不到在用值和NA汇总值时做错了什么。 I have read everywhere around that you can count cases in summarise with sum(), and that, to count NA cases, it could be used sum(is.na(variable)). 我到处都读到,您可以使用sum()来汇总总结中的案例,要计算NA案例,可以使用sum(is.na(variable))。

Actually, I can reproduce that behaviour with a test tibble: 实际上,我可以通过测试来重现该行为:

df <- tibble(x = c(rep("a",5), rep("b",5)), y = c(NA, NA, 1, 1, NA, 1, 1, 1, NA, NA))

df %>%
  group_by(x) %>% 
  summarise(one = sum(y, na.rm = T),
            na = sum(is.na(y)))

And this is the expected result: 这是预期的结果:

# A tibble: 2 x 3
      x   one    na
  <chr> <dbl> <int>
1     a     2     3
2     b     3     2

For some reason, I cannot reproduce the result with my data: 由于某些原因,我无法使用数据重现结果:

mydata <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Amphibians", 
"Birds", "Mammals", "Reptiles", "Plants"), class = "factor"), 
    Scenario = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
    1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Present", 
    "RCP 4.5", "RCP 8.5"), class = "factor"), year = c(1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940), random = c("obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs"), 
    species = c("Allobates fratisenescus", "Allobates fratisenescus", 
    "Allobates fratisenescus", "Allobates juanii", "Allobates juanii", 
    "Allobates juanii", "Allobates kingsburyi", "Allobates kingsburyi", 
    "Allobates kingsburyi", "Adelophryne adiastola", "Adelophryne adiastola", 
    "Adelophryne adiastola", "Adelophryne gutturosa", "Adelophryne gutturosa", 
    "Adelophryne gutturosa", "Adelphobates quinquevittatus", 
    "Adelphobates quinquevittatus", "Adelphobates quinquevittatus"
    ), Endemic = c(1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA)), row.names = c(NA, -18L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "species", indices = list(
    9:11, 12:14, 15:17, 0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 
3L, 3L, 3L, 3L), biggest_group_size = 3L, labels = structure(list(
    species = c("Adelophryne adiastola", "Adelophryne gutturosa", 
    "Adelphobates quinquevittatus", "Allobates fratisenescus", 
    "Allobates juanii", "Allobates kingsburyi")), row.names = c(NA, 
-6L), class = "data.frame", vars = "species", .Names = "species"), .Names = c("Group", 
"Scenario", "year", "random", "species", "Endemic"))

(my data has several millions of rows, I reproduce here only a part of it) (我的数据有几百万行,在这里我只复制了一部分)

Testsum <- mydata %>% 
  group_by(Group, Scenario, year, random) %>% 
  summarise(All = n(),
            Endemic = sum(Endemic, na.rm = T),
            noEndemic = sum(is.na(Endemic)))

# A tibble: 3 x 7
# Groups:   Group, Scenario, year [?]
       Group Scenario  year random   All Endemic noEndemic
      <fctr>   <fctr> <dbl>  <chr> <int>   <dbl>     <int>
1 Amphibians  Present  1940    obs     6       3         0
2 Amphibians  RCP 4.5  1940    obs     6       3         0
3 Amphibians  RCP 8.5  1940    obs     6       3         0

!!!! !!!! I expected no Endemic to be 3 for all cases, as there are NA in 3 of the species... 我预计所有病例中都没有3个地方病,因为3个物种中都没有NA。

I doubled-checked that: 我仔细检查了一下:

Test3$Endemic %>% class
[1] "numeric"

Obviously, there is something very stupid I am not seen... after several hours messing around. 显然,在经过几个小时的混乱之后,我没有看到非常愚蠢的东西。 Is it obvious for any of you? 对你们所有人来说明显吗? Thanks!!! 谢谢!!!

The reason for this behavior is that we assigned Endemic as a new summarized variable. 出现这种情况的原因是,我们将“ Endemic分配为新的汇总变量。 Instead we should be having a new column name 相反,我们应该有一个新的列名

mydata %>%
     group_by(Group, Scenario, year, random) %>%
     summarise(All = n(),
               EndemicS = sum(Endemic, na.rm = TRUE),
               noEndemic = sum(is.na(Endemic))) %>%
     rename(Endemic = EndemicS) 
# A tibble: 3 x 7
# Groups:   Group, Scenario, year [3]
#       Group Scenario  year random   All Endemic noEndemic
#      <fctr>   <fctr> <dbl>  <chr> <int>   <dbl>     <int>
#1 Amphibians  Present  1940    obs     6       3         3
#2 Amphibians  RCP 4.5  1940    obs     6       3         3
#3 Amphibians  RCP 8.5  1940    obs     6       3         3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM