简体   繁体   中英

Conditional Sum of a column in R with NAs

I have a dataset with 4 columns which looks like that:

City Year Week Average
Guelph 2020 2020-04-12 28.3
Hamilton 2020 2020-04-12 10.7
Waterloo 2020 2020-04-12 50.1
Guelph 2020 2020-04-20 3.5
Hamilton 2020 2020-04-20 42.9

I would like to sum the average column for the same week. In other words, I want to create a new dataset with three columns (Year, week, Average) where I won't have 3 different rows for the same week but only one (eg instead of having three times 20220-04-12, I will have it one) and the corresponding cell in the average column will be the sum of all the rows that correspond to the same week. Something like that:

Year Week Average
2020 2020-04-12 89.1
2020 2020-04-20 46.4

where 89.1 is the sum of the first three rows that are at the same week and 46.4 is the sum of the last two rows of the initial table that correspond to the same week (2020-04-20).

The code I am using for that looks like that:

data_set <- data_set %>%
    select(`Year`, `Week`, `Average`) %>%
    group_by(Year, Week) %>% 
    summarize(Average = sum(Average))
   

but for some weeks I am getting back NAs and for some other I get the correct sum I want. The data are all numeric and in the initial dataset there are some NA values on the Average column.

Thanks in advance

You can accomplish this by passing in na.rm = TRUE to sum . Also, since you group_by(Year, Week) , there isn't much to gain with using select in this case since you are generating a summary statistic on the Average variable within summarise .

df <- structure(list(City = c("Guelph", "Hamilton", "Waterloo", "Guelph", 
"Hamilton"), Year = c(2020L, 2020L, 2020L, 2020L, 2020L), Week = c("2020-04-12", 
"2020-04-12", "2020-04-12", "2020-04-20", "2020-04-20"), Average = c(28.3, 
10.7, 50.1, 3.5, 42.9)), class = "data.frame", row.names = c(NA, 
-5L))

library(dplyr)

df %>% 
  mutate(
    Week = as.Date(Week),
  ) %>% 
  group_by(Year, Week) %>% 
  summarise(
    Average = sum(Average, na.rm = TRUE)
  )

#> # A tibble: 2 x 3
#> # Groups:   Year [1]
#>    Year Week       Average
#>   <int> <date>       <dbl>
#> 1  2020 2020-04-12    89.1
#> 2  2020 2020-04-20    46.4

Created on 2021-03-10 by the reprex package (v0.3.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM