簡體   English   中英

dplyr匯總中的NA案件總數

[英]Sum NA cases in dplyr's summarise

我找不到在用值和NA匯總值時做錯了什么。 我到處都讀到,您可以使用sum()來匯總總結中的案例,要計算NA案例,可以使用sum(is.na(variable))。

實際上,我可以通過測試來重現該行為:

df <- tibble(x = c(rep("a",5), rep("b",5)), y = c(NA, NA, 1, 1, NA, 1, 1, 1, NA, NA))

df %>%
  group_by(x) %>% 
  summarise(one = sum(y, na.rm = T),
            na = sum(is.na(y)))

這是預期的結果:

# A tibble: 2 x 3
      x   one    na
  <chr> <dbl> <int>
1     a     2     3
2     b     3     2

由於某些原因,我無法使用數據重現結果:

mydata <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Amphibians", 
"Birds", "Mammals", "Reptiles", "Plants"), class = "factor"), 
    Scenario = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
    1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Present", 
    "RCP 4.5", "RCP 8.5"), class = "factor"), year = c(1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940), random = c("obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs"), 
    species = c("Allobates fratisenescus", "Allobates fratisenescus", 
    "Allobates fratisenescus", "Allobates juanii", "Allobates juanii", 
    "Allobates juanii", "Allobates kingsburyi", "Allobates kingsburyi", 
    "Allobates kingsburyi", "Adelophryne adiastola", "Adelophryne adiastola", 
    "Adelophryne adiastola", "Adelophryne gutturosa", "Adelophryne gutturosa", 
    "Adelophryne gutturosa", "Adelphobates quinquevittatus", 
    "Adelphobates quinquevittatus", "Adelphobates quinquevittatus"
    ), Endemic = c(1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA)), row.names = c(NA, -18L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "species", indices = list(
    9:11, 12:14, 15:17, 0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 
3L, 3L, 3L, 3L), biggest_group_size = 3L, labels = structure(list(
    species = c("Adelophryne adiastola", "Adelophryne gutturosa", 
    "Adelphobates quinquevittatus", "Allobates fratisenescus", 
    "Allobates juanii", "Allobates kingsburyi")), row.names = c(NA, 
-6L), class = "data.frame", vars = "species", .Names = "species"), .Names = c("Group", 
"Scenario", "year", "random", "species", "Endemic"))

(我的數據有幾百萬行,在這里我只復制了一部分)

Testsum <- mydata %>% 
  group_by(Group, Scenario, year, random) %>% 
  summarise(All = n(),
            Endemic = sum(Endemic, na.rm = T),
            noEndemic = sum(is.na(Endemic)))

# A tibble: 3 x 7
# Groups:   Group, Scenario, year [?]
       Group Scenario  year random   All Endemic noEndemic
      <fctr>   <fctr> <dbl>  <chr> <int>   <dbl>     <int>
1 Amphibians  Present  1940    obs     6       3         0
2 Amphibians  RCP 4.5  1940    obs     6       3         0
3 Amphibians  RCP 8.5  1940    obs     6       3         0

!!!! 我預計所有病例中都沒有3個地方病,因為3個物種中都沒有NA。

我仔細檢查了一下:

Test3$Endemic %>% class
[1] "numeric"

顯然,在經過幾個小時的混亂之后,我沒有看到非常愚蠢的東西。 對你們所有人來說明顯嗎? 謝謝!!!

出現這種情況的原因是,我們將“ Endemic分配為新的匯總變量。 相反,我們應該有一個新的列名

mydata %>%
     group_by(Group, Scenario, year, random) %>%
     summarise(All = n(),
               EndemicS = sum(Endemic, na.rm = TRUE),
               noEndemic = sum(is.na(Endemic))) %>%
     rename(Endemic = EndemicS) 
# A tibble: 3 x 7
# Groups:   Group, Scenario, year [3]
#       Group Scenario  year random   All Endemic noEndemic
#      <fctr>   <fctr> <dbl>  <chr> <int>   <dbl>     <int>
#1 Amphibians  Present  1940    obs     6       3         3
#2 Amphibians  RCP 4.5  1940    obs     6       3         3
#3 Amphibians  RCP 8.5  1940    obs     6       3         3

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM