简体   繁体   中英

Combining two unequal datasets to calculate proportion

I know there are similar questions about calculating the proportion of each group, but they are in the same dataset. I have two datasets, one contains the information of user ID, date and the total duration of them using phone apps daily; another one contains the same ID, date but the duration of each app categories daily(which means if you sum them per user by day, they will equal to the first dataset)

dput for dataset 1:

dat_1 <- structure(list(user_id = c(10161L, 10161L, 10161L, 10161L, 10161L, 
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948, 
17949, 17950, 17951, 17952, 17953, 17954, 17955, 17956, 17957, 
17958, 17959, 17960, 17961, 17962, 17963, 17964, 17965, 17966, 
17967), class = "Date"), duration = structure(c(5212.71700000763, 
20655.6629965305, 14162.9649987221, 18286.7030012608, 15315.1349999905, 
17845.9039983749, 15864.4930007458, 14331.2430002689, 16331.9680001736, 
18098.3090002537, 20003.6570017338, 15547.8630020618, 18242.8340024948, 
24890.6929991245, 24226.1790001392, 26849.5739989281, 21208.1910011768, 
20396.9730014801, 24253.9579980373, 20673.4809997082), class = "difftime", units = "secs")), row.names = c(NA, 
-20L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = "user_id", drop = TRUE, indices = list(
    0:19), group_sizes = 20L, biggest_group_size = 20L, labels = structure(list(
    user_id = 10161L), row.names = c(NA, -1L), class = "data.frame", vars = "user_id", drop = TRUE))

dput for dataset 2:

dat_2 <- structure(list(user_id = c(10161L, 10161L, 10161L, 10161L, 10161L, 
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948, 
17948, 17948, 17948, 17949, 17949, 17949, 17949, 17949, 17950, 
17950, 17950, 17950, 17951, 17951, 17951, 17951, 17952, 17952, 
17952), class = "Date"), categories = structure(c(1L, 2L, 3L, 
6L, 1L, 2L, 3L, 5L, 6L, 1L, 2L, 3L, 6L, 1L, 2L, 3L, 6L, 1L, 2L, 
3L), .Label = c("communication", "games & entertainment", "lifestyle", 
"news & information outlet", "social network", "utility & tools"
), class = "factor"), cat_duration = structure(c(1770.70500040054, 
1855.2380001545, 38.9109997749329, 1547.86299967766, 7010.0589993, 
10680.9569990635, 71.5590000152588, 741.676999807358, 2151.41099834442, 
5154.79599928856, 5501.70999979973, 116.311000108719, 3390.14799952507, 
12149.4220018387, 5009.53099989891, 371.340999603271, 756.408999919891, 
5633.53999876976, 8119.65800046921, 347.116999864578), class = "difftime", units = "secs")), row.names = c(NA, 
-20L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = c("user_id", 
"date"), drop = TRUE, indices = list(0:3, 4:8, 9:12, 13:16, 17:19), group_sizes = c(4L, 
5L, 4L, 4L, 3L), biggest_group_size = 5L, labels = structure(list(
    user_id = c(10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948, 
    17949, 17950, 17951, 17952), class = "Date")), row.names = c(NA, 
-5L), class = "data.frame", vars = c("user_id", "date"), drop = TRUE))

I would like to add a new column for the second dataset which shows the proportion of duration of each category based on the daily duration, looking like this:

     user_id date       categories            cat_duration     proportion 
     <int> <date>     <fct>                 <time>        
 1   10161 2019-02-21 communication          1770.705 secs       20%
 2   10161 2019-02-21 games & entertainment  1855.238 secs       21%
 3   10161 2019-02-21 lifestyle                38.911 secs       0.2%
 4   10161 2019-02-21 utility & tools        1547.863 secs       2%
 5   10161 2019-02-22 communication          7010.059 secs       14%
 6   10161 2019-02-22 games & entertainment 10680.957 secs       22%

However, I tried like this,which I would already assume that is not going to work due to the different length:

category_duration$proportion <- (category_duration$cat_duration / daily_duration$duration)

and something is also wrong with the second argument itself, as it is the time object. The error was: 'second argument of / cannot be a "difftime" object'. Thank you in advance for your help!

I would approach in the following way. This joins the daily duration to the category duration, converts the difftime objects to numbers and divides the two.

category_duration %>%
  left_join(daily_duration, by = c("user_id", "date")) %>% 
  mutate(cat_duration_proportion = as.numeric(cat_duration, units = "secs") / as.numeric(duration, units = "secs"))

Your columns cat_duration and duration are not just numbers but of type difftime . That is a data type for time differences and consist not only of a number, but also of a unit.

Does this answer help you? Divide two difftime objects

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM