I have a tibble of 65524 observations where one variable is an ID for an household and the other is factor where the value of 1
is assigned if the age of the person in the household is less then 15y.o., 2
is assigned if the age is between 15 and 64, and 3
is assigned if the age of the person is 65 or older. The tibble look like this
> head(df, 15)
# A tibble: 15 x 2
hh.id age.cat
<dbl> <dbl+lbl>
1 11009 2
2 11009 2
3 11009 2
4 11009 2
5 11009 2
6 11009 1
7 11009 1
8 11009 1
9 11018 2
10 11018 1
11 11018 1
12 11018 1
13 11018 1
14 11018 2
15 11018 2
I need to create a variable to estimate the dependency ratio of each household. Something similar to this
> head(df, 15)
# A tibble: 15 x 3
hh.id age.cat dep.ratio
<dbl> <dbl+lbl><dbl>
1 11009 2 0.60
2 11009 2 0.60
3 11009 2 0.60
4 11009 2 0.60
5 11009 2 0.60
6 11009 1 0.60
7 11009 1 0.60
8 11009 1 0.60
9 11018 2 1.25
10 11018 1 1.25
11 11018 1 1.25
12 11018 1 1.25
13 11018 1 1.25
14 11018 2 1.25
15 11018 2 1.25
I thought that using dplyr::mutate
and dplyr::group_by
would work
df <- df %>%
dplyr::group_by(hh.id) %>%
dplyr::mutate(dep.ratio = (length(which(df$age.cat == 1)) + length(which(df$age.cat == 3)))/length(which(df$age.cat == 2)))
However, I do not get the estimates per each group (ie per each household), but I get the overall dependency ratio for the whole sample, repeated for each observation.
# A tibble: 15 x 3
# Groups: hh.id [2]
hh.id age.cat dep.ratio
<dbl> <dbl+lbl> <dbl>
1 11009 2 1.02
2 11009 2 1.02
3 11009 2 1.02
4 11009 2 1.02
5 11009 2 1.02
6 11009 1 1.02
7 11009 1 1.02
8 11009 1 1.02
9 11018 2 1.02
10 11018 1 1.02
11 11018 1 1.02
12 11018 1 1.02
13 11018 1 1.02
14 11018 2 1.02
15 11018 2 1.02
I then considered using tapply
, but I could not write a function which conditions on the values of hh.id
. Finally, I also tried aggregate
, but without any luck.
Any suggestion is welcome.
Thanks
Manolo
Here's an option:
ratiodf<- df %>% group_by(hh.id,age.cat) %>%
summarize(n=n()) %>%
spread(age.cat,n) %>%
mutate(ratio=(`1`+`3`)/`2`)
This will give you something like this:
# A tibble: 2 x 4
# Groups: hh.id [2]
hh.id `1` `2` ratio
<int> <int> <int> <dbl>
1 11009 3 5 0.6
2 11018 4 3 1.33
If you need to keep the data from the original df (eg, other columns), you can continue with a left_join:
left_join(df, ratiodf[,c(-2:-3)], by="hh.id")
Which will result in this:
hh.id age.cat ratio
1 11009 2 0.600000
2 11009 2 0.600000
3 11009 2 0.600000
4 11009 2 0.600000
5 11009 2 0.600000
6 11009 1 0.600000
7 11009 1 0.600000
8 11009 1 0.600000
9 11018 2 1.333333
10 11018 1 1.333333
11 11018 1 1.333333
12 11018 1 1.333333
13 11018 1 1.333333
14 11018 2 1.333333
15 11018 2 1.333333
The reason your code didn't work is that as soon as your put in the base format of df$hh.id
etc., you bypass the tidyverse grouping and get the full column.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.