简体   繁体   中英

How to use summarize and group by to create a summary table in r?

So I have the following data set (this is a small sample/example of what it looks like, with the original being 7k rows and 30 columns over 7 decades):

Year,Location,Population Total, Median Age, Household Total
2000, Adak, 220, 45, 67
2000, Akiachak, 567, NA, 98
2000, Rainfall, 2, NA, 11
1990, Adak, NA, 33, 56
1990, Akiachak, 456, NA, 446
1990, Tioga, 446, NA, NA

I want to create a summary table that indicates how many years of data is available by location for each variable. So something like this would work (for the small example from before):

Location,Population Total, Median Age, Household Total
Adak,1,2,2
Akiachak,2,0,2
Rainfall,1,0,1
Tioga,1,0,0

I'm new to R and haven't used these two commands together so I'm unsure of the syntax. Any help would be wonderful or alternatives.

You can do something like this:

x %>%
  group_by(Location) %>%
  summarise(count_years = n(), 
            count_pop_total = sum(!is.na(Population_Total)),
            count_median_age = sum(!is.na(Median_Age)),
            count_house_total = sum(!is.na(Household_Total)))

where you can replace the mean with whatever operation you want to perform. You should take a look at the dplyr vignette for more general solutions.

A solution with summarize_all from dplyr :

library(dplyr)
df %>%
  group_by(Location) %>%
  summarize_all(funs(sum(!is.na(.)))) %>%
  select(-Year)

Or you can use summarize_at :

df %>%
  group_by(Location) %>%
  summarize_at(vars(-Year), funs(sum(!is.na(.))))

Result:

# A tibble: 4 x 4
   Location PopulationTotal MedianAge HouseholdTotal
      <chr>           <int>     <int>          <int>
1      Adak               1         2              2
2  Akiachak               2         0              2
3  Rainfall               1         0              1
4     Tioga               1         0              0

Data:

df = read.table(text = "Year,Location,PopulationTotal, MedianAge, HouseholdTotal
                2000, Adak, 220, 45, 67
                2000, Akiachak, 567, NA, 98
                2000, Rainfall, 2, NA, 11
                1990, Adak, NA, 33, 56
                1990, Akiachak, 456, NA, 446
                1990, Tioga, 446, NA, NA", header = TRUE, sep = ",", stringsAsFactors = FALSE)

library(dplyr)    
df = df %>%
  mutate_at(vars(PopulationTotal:HouseholdTotal), as.numeric)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM