简体   繁体   中英

Aggregating columns matching two criteria

I've used the aggregate() function a bit but always with FUN=mean rather than sum . It seems to have a few idiosyncracies which are making things difficult.

I am using the World Health organisation's mortality dataset which is formatted as so:

> head(data)
  Country Admin1 SubDiv Year List Cause Sex Frmat IM_Frmat Deaths1 Deaths2 Deaths3 Deaths4 Deaths5 Deaths6 Deaths7 Deaths8 Deaths9 Deaths10 Deaths11 Deaths12 Deaths13 Deaths14 Deaths15 Deaths16 Deaths17
1    1125     NA        2000  103   V72   1     2        8       1       0       0      NA      NA      NA       0       0       0        0        0        0        1        0        0        0        0
2    1125     NA        2000  103   V77   1     2        8       1       0       0      NA      NA      NA       0       0       0        0        0        1        0        0        0        0        0
3    1125     NA        2000  103   V78   1     2        8      19       1       0      NA      NA      NA       0       2       2        0        2        2        1        3        1        1        2
4    1125     NA        2000  103   V58   1     2        8      30       0       1      NA      NA      NA       1       2       2        1        4        3        2        4        5        1        0
5    1125     NA        2000  103   V58   2     2        8       4       0       0      NA      NA      NA       1       0       0        0        0        2        1        0        0        0        0
6    1125     NA        2000  103   V74   1     2        8       5       0       0      NA      NA      NA       1       1       0        0        1        0        1        0        1        0        0
  Deaths18 Deaths19 Deaths20 Deaths21 Deaths22 Deaths23 Deaths24 Deaths25 Deaths26 IM_Deaths1 IM_Deaths2 IM_Deaths3 IM_Deaths4  Name
1        0        0        0        0        0        0       NA       NA        0          0         NA         NA         NA Egypt
2        0        0        0        0        0        0       NA       NA        0          0         NA         NA         NA Egypt
3        0        1        0        0        0        1       NA       NA        0          1         NA         NA         NA Egypt
4        2        1        1        0        0        0       NA       NA        0          0         NA         NA         NA Egypt
5        0        0        0        0        0        0       NA       NA        0          0         NA         NA         NA Egypt
6        0        0        0        0        0        0       NA       NA        0          0         NA         NA         NA Egypt

This shows the cause of each for multiple diseases (V72, V77 etc.) for 1 country (1125 == Egypt) in 1 year, for men (Sex=1) and women (Sex=2). The deaths columns are number of deaths per age group.

The thing is I want to group the number of deaths for a few disease codes, per country, per year, per sex. Specifically I want the disease codes I20* - I25*. These corresponds to the ischaemic heart disease deaths.

The first thing I've done is select only those rows:

codes = c("1067","I20","I21","I22","I23","I24","I25",paste("I",200:250,sep="")) #ICD-10 codes for IHD

data_ihd <- subset(data, Cause %in% codes)

And then I try to use aggregate to sum the rows which match for Country, Year, Sex. I no longer care about the disease code - this dataset has 1 row per disease code for each country/year/sex combination - I want to treat all the diseases as 1 and sum the deaths.

It seems like I should be able to do:

> aggregate(data_ihd, by=list(data_ihd$Name,data_ihd$Year,data_ihd$Sex),FUN=sum)
Error in Summary.factor(3518L, na.rm = FALSE) : 
  ‘sum’ not meaningful for factors

But it seems not. Which is weirdly because it's happy if I use FUN=mean .

I know it's because I have factors, but I don't see how I can lose factors without actively deleting the columns?

What would you advise?

I'm happy to use a package like dplyr or tidyr but not sure how they would help here...

EDIT: I guess I also need to be careful here that I only sum the Deaths* columns rather than the Country, Year columns! I'm not sure if aggregate makes sure it doesn't sum the columns you supply in by=

EDIT: I've been asked to give a bit more info about what I want. If we take the following dataset:

> head(data)
  Country Admin1 SubDiv Year List Cause Sex Frmat IM_Frmat Deaths1 Deaths2 Deaths3 Deaths4 Deaths5 Deaths6 Deaths7 Deaths8 Deaths9 Deaths10 Deaths11 Deaths12 Deaths13 Deaths14 Deaths15 Deaths16 Deaths17
1    1000     NA        2000  103   1   1     2        8       1       0       0      NA      NA      NA       0       0       0        0        0        0        1        0        0        0        0
2    1000     NA        2000  103   1   2     2        8       1       0       0      NA      NA      NA       0       0       0        0        0        1        0        0        0        0        0
3    1000     NA        2000  103   2   1     2        8      19       1       0      NA      NA      NA       0       2       2        0        2        2        1        3        1        1        2
4    1000     NA        2000  103   2   2     2        8      30       0       1      NA      NA      NA       1       2       2        1        4        3        2        4        5        1        0
5    1000     NA        2001  103   1   1     2        8       4       0       0      NA      NA      NA       1       0       0        0        0        2        1        0        0        0        0
6    1000     NA        2001  103   1   2     2        8       5       0       0      NA      NA      NA       1       1       0        0        1        0        1        0        1        0        0
7    1000     NA        2001  103   2   1     2        8       4       0       0      NA      NA      NA       1       0       0        0        0        2        1        0        0        0        0
8    1000     NA        2001  103   2   2     2        8       5       0       0      NA      NA      NA       1       1       0        0        1        0        1        0        1        0        0
9    2000     NA        2000  103   1   1     2        8       4       0       0      NA      NA      NA       1       0       0        0        0        2        1        0        0        0        0
10   2000     NA        2000  103   1   2     2        8       5       0       0      NA      NA      NA       1       1       0        0        1        0        1        0        1        0        0

Becomes

  Country Admin1 SubDiv Year List Sex Frmat IM_Frmat Deaths1 Deaths2 Deaths3 Deaths4 Deaths5 Deaths6 Deaths7 Deaths8 Deaths9 Deaths10 Deaths11 Deaths12 Deaths13 Deaths14 Deaths15 Deaths16 Deaths17
1    1000     NA        2000  103   1     2        8      20       1       0      NA      NA      NA       0       2       2        0        2        2        3        3        1        1        2
2    1000     NA        2000  103   2     2        8      31       0       1      NA      NA      NA       1       2       2        1        4        4        2        4        5        1        0
...
.    2000     NA        2000 ...

As you can see, rows 1 and 3 have been summed for the Deaths columns because the Country, Year and Sex are equal. The cause is ignored. The matching columns have not been summed, of course.

EDIT 3: Let's simplyify this a lot, then:

  Country Year   Sex Cause  Deaths1 Deaths2 Deaths3
1    UK    2000   1    A    1       1       1
2    UK    2000   2    A    1       1       1
3    UK    2000   1    B    1       1       1
4    UK    2000   2    B    1       1       1
5    UK    2001   1    A    1       1       1
6    UK    2001   2    A    1       1       1
7    UK    2001   1    B    1       1       1
8    UK    2001   2    B    1       1       1
1    USA   2000   1    A    1       1       1
2    USA   2000   2    A    1       1       1
3    USA   2000   1    B    1       1       1
4    USA   2000   2    B    1       1       1
5    USA   2001   1    A    1       1       1
6    USA   2001   2    A    1       1       1
7    USA   2001   1    B    1       1       1
8    USA   2001   2    B    1       1       1
...

Group all the causes together (sum the deaths), when County, Year and Sex are the same, becoming:

  Country Year   Sex Deaths1 Deaths2 Deaths3 
1    UK    2000   1    2       2       2
2    UK    2000   2    2       2       2
3    UK    2001   1    2       2       2
4    UK    2001   2    2       2       2
5    USA   2000   1    2       2       2
6    USA   2000   2    2       2       2
7    USA   2001   1    2       2       2
8    USA   2001   2    2       2       2

I can't seem to use aggregate here because 1) Country is a factor; 2) it would sum the years

EDIT: given what you are saying above, this should get you there:

ok so you want to know how many people died per country, year and their sex, ignoring the cause of the death:

require(dplyr, magrittr)

group_by(data, Country, Year, Sex) %>% # here you specify which variable you wanna group by
summarise("deathsum.I20" = sum(I20), #now the sum gets calculated for each # level of the grouping variable
"deathsum.1" = sum(Deaths1),
    "deathsum.2" = sum(Deaths2),
    "deathsum.3" = sum(Deaths3),
    "deathsum.4" = sum(Deaths4)))

the same logic applies for the rest of the Death categories, this ignores the causes, which is what you want,

If you also wanna ignore the age and just sum all the death-groups you could do this before running the rest of the above code:

mutate(data, deathsum = rowsum(select(data, Deaths1:Deaths25))

this gives you a new variable in the dataframe data named deathsum which just sums all the deaths for the different ages

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM