简体   繁体   中英

Get ratio of levels of a factor per another factor

I would like to caculate the ratio of the levels of status (inactive/active) for each production_year :

df[1:150, c("production_year", "status")]
     production_year   status
4               2010 inactive
12              2008 inactive
13              2011 inactive
15              2007 inactive
23              2008 inactive
24              2011 inactive
26              2006 inactive
27              2011 inactive
29              2006 inactive
32              2007 inactive
34              2008 inactive
36              2006 inactive
41              2011 inactive
45              2011 inactive
48              2007 inactive
53              2007 inactive
56              2006 inactive
57              2006 inactive
59              2011 inactive
61              2008 inactive
63              2010 inactive
65              2010 inactive
NA              <NA>     <NA>
NA.1            <NA>     <NA>
72              2012 inactive
76              2013 inactive
78              2012 inactive
NA.2            <NA>     <NA>
100             2010 inactive
102             2012 inactive
104             2013 inactive
NA.3            <NA>     <NA>
111             2011 inactive
112             2011 inactive
114             2009 inactive
115             2012 inactive
117             2012 inactive
121             2009 inactive
124             2016 inactive
125             2017 inactive
127             2013 inactive
130             2017 inactive
133             2015 inactive
134             2016 inactive
135             2017 inactive
136             2018 inactive
139             2014 inactive
140             2015 inactive
141             2015   active
143             2011 inactive
144             2011 inactive
146             2011 inactive
147             2012 inactive
148             2013 inactive
149             2013 inactive
150             2014 inactive
153             2016 inactive
154             2017 inactive
155             2017 inactive
158             2012 inactive
159             2013 inactive
160             2014 inactive
162             2012 inactive
164             2013 inactive
165             2014 inactive
166             2015 inactive
167             2010 inactive
169             2014 inactive
170             2015   active
171             2015 inactive
173             2009 inactive
174             2011 inactive
175             2011 inactive
176             2012 inactive
177             2013 inactive
181             2017 inactive
183             2010 inactive
186             2011 inactive
187             2011 inactive
189             2011 inactive
191             2012 inactive
192             2013 inactive
193             2013 inactive
194             2014 inactive
196             2013 inactive
197             2014 inactive
199             2017   active
201             2013 inactive
203             2013 inactive
205             2013 inactive
206             2014 inactive
209             2017 inactive
211             2013 inactive
....

I know that I have to count() the inactive units and divide them by count() of the active units per year. Though, I don't know how to do this in a smart (fast) way. Finally, I would like to have something like

production_year ratio
2010 0.75
2011 0.78
2012 0.79
..

edit: I got it working, though it's quite dirty..

df_ratio<- df %>% group_by(production_year, status) %>% tally
df_ratio<- df_ratio %>%
  spread(production_year, n)
df_ratio<- as.data.frame(df_ratio)
df_ratio<- df_ratio[-c(3), ]
df_ratio<- df_ratio %>% select(where(~!any(is.na(.))))
df_ratio[3, 2:ncol(df_ratio)]<- df_ratio[2, 2:ncol(df_ratio)] / df_ratio[1, 2:ncol(df_ratio)]

> df_ratio
    status          ? 2003 2004  2005   2006      2007      2008   2009  2010       2011       2012
1   active 36.0000000    1    1   5.0   4.00  18.00000  33.00000  25.00  12.0  74.000000 109.000000
2 inactive 17.0000000   28   40 118.0 207.00 257.00000 368.00000 328.00 318.0 681.000000 444.000000
3     <NA>  0.4722222   28   40  23.6  51.75  14.27778  11.15152  13.12  26.5   9.202703   4.073394
        2013       2014 2015  2016     2017
1  74.000000  56.000000    5   2.0  3.00000
2 410.000000 146.000000  145 167.0 95.00000
3   5.540541   2.607143   29  83.5 31.66667

After count ing, you can pivot_wider so you end up with a column of active and a column of inactive . You can then easily calculate the ratio. (Of course you can't calculate a ratio if the denominator is NA or 0 ).

library(dplyr)
library(tidyr)

df <- structure(list(production_year = c(2017, 2017, 2015, 2017, 2015, 
2015, 2017, 2017, 2015, 2015, 2015, 2017, 2017, 2017), status = c("inactive", 
"inactive", "inactive", "inactive", "inactive", "active", "inactive", 
"inactive", "inactive", "active", "inactive", "inactive", "active", 
"inactive")), row.names = c(NA, -14L), class = c("tbl_df", "tbl", 
"data.frame"))

df |> 
  group_by(production_year) |> 
  count(status) |> 
  pivot_wider(everything(), names_from = "status", values_from = "n") |> 
  mutate(ratio = round(active/inactive, 2))

#> # A tibble: 2 x 4
#>   production_year active inactive ratio
#>             <dbl>  <int>    <int> <dbl>
#> 1            2015      2        4  0.5 
#> 2            2017      1        7  0.14

Created on 2022-04-22 by the reprex package (v2.0.1)

df <- structure(list(production_year = c(2017, 2017, 2015, 2017, 2015, 
                                         2015, 2017, 2017, 2015, 2015, 2015, 2017, 2017, 2017), status = c("inactive", 
                                                                                                           "inactive", "inactive", "inactive", "inactive", "active", "inactive", 
                                                                                                           "inactive", "inactive", "active", "inactive", "inactive", "active", 
                                                                                                           "inactive")), row.names = c(NA, -14L), class = c("tbl_df", "tbl", 
                                                                                                                                                            "data.frame"))

library(tidyverse)
df %>% 
  pivot_wider(
    production_year,
    names_from = status,
    values_from = status,
    values_fn = length
  ) %>% 
  mutate(ratio  = active / inactive) %>% 
  arrange()
#> # A tibble: 2 x 4
#>   production_year inactive active ratio
#>             <dbl>    <int>  <int> <dbl>
#> 1            2017        7      1 0.143
#> 2            2015        4      2 0.5


library(magrittr)
library(data.table)

setDT(df)
dcast(df, production_year ~ status) %>% 
  .[, ratio  := active / inactive] %>% 
  .[]

#>    production_year active inactive     ratio
#> 1:            2015      2        4 0.5000000
#> 2:            2017      1        7 0.1428571

Created on 2022-04-22 by the reprex package (v2.0.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM