I know there are similar questions about calculating the proportion of each group, but they are in the same dataset. I have two datasets, one contains the information of user ID, date and the total duration of them using phone apps daily; another one contains the same ID, date but the duration of each app categories daily(which means if you sum them per user by day, they will equal to the first dataset)
dput for dataset 1:
dat_1 <- structure(list(user_id = c(10161L, 10161L, 10161L, 10161L, 10161L,
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L,
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948,
17949, 17950, 17951, 17952, 17953, 17954, 17955, 17956, 17957,
17958, 17959, 17960, 17961, 17962, 17963, 17964, 17965, 17966,
17967), class = "Date"), duration = structure(c(5212.71700000763,
20655.6629965305, 14162.9649987221, 18286.7030012608, 15315.1349999905,
17845.9039983749, 15864.4930007458, 14331.2430002689, 16331.9680001736,
18098.3090002537, 20003.6570017338, 15547.8630020618, 18242.8340024948,
24890.6929991245, 24226.1790001392, 26849.5739989281, 21208.1910011768,
20396.9730014801, 24253.9579980373, 20673.4809997082), class = "difftime", units = "secs")), row.names = c(NA,
-20L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = "user_id", drop = TRUE, indices = list(
0:19), group_sizes = 20L, biggest_group_size = 20L, labels = structure(list(
user_id = 10161L), row.names = c(NA, -1L), class = "data.frame", vars = "user_id", drop = TRUE))
dput for dataset 2:
dat_2 <- structure(list(user_id = c(10161L, 10161L, 10161L, 10161L, 10161L,
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L,
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948,
17948, 17948, 17948, 17949, 17949, 17949, 17949, 17949, 17950,
17950, 17950, 17950, 17951, 17951, 17951, 17951, 17952, 17952,
17952), class = "Date"), categories = structure(c(1L, 2L, 3L,
6L, 1L, 2L, 3L, 5L, 6L, 1L, 2L, 3L, 6L, 1L, 2L, 3L, 6L, 1L, 2L,
3L), .Label = c("communication", "games & entertainment", "lifestyle",
"news & information outlet", "social network", "utility & tools"
), class = "factor"), cat_duration = structure(c(1770.70500040054,
1855.2380001545, 38.9109997749329, 1547.86299967766, 7010.0589993,
10680.9569990635, 71.5590000152588, 741.676999807358, 2151.41099834442,
5154.79599928856, 5501.70999979973, 116.311000108719, 3390.14799952507,
12149.4220018387, 5009.53099989891, 371.340999603271, 756.408999919891,
5633.53999876976, 8119.65800046921, 347.116999864578), class = "difftime", units = "secs")), row.names = c(NA,
-20L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = c("user_id",
"date"), drop = TRUE, indices = list(0:3, 4:8, 9:12, 13:16, 17:19), group_sizes = c(4L,
5L, 4L, 4L, 3L), biggest_group_size = 5L, labels = structure(list(
user_id = c(10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948,
17949, 17950, 17951, 17952), class = "Date")), row.names = c(NA,
-5L), class = "data.frame", vars = c("user_id", "date"), drop = TRUE))
I would like to add a new column for the second dataset which shows the proportion of duration of each category based on the daily duration, looking like this:
user_id date categories cat_duration proportion
<int> <date> <fct> <time>
1 10161 2019-02-21 communication 1770.705 secs 20%
2 10161 2019-02-21 games & entertainment 1855.238 secs 21%
3 10161 2019-02-21 lifestyle 38.911 secs 0.2%
4 10161 2019-02-21 utility & tools 1547.863 secs 2%
5 10161 2019-02-22 communication 7010.059 secs 14%
6 10161 2019-02-22 games & entertainment 10680.957 secs 22%
However, I tried like this,which I would already assume that is not going to work due to the different length:
category_duration$proportion <- (category_duration$cat_duration / daily_duration$duration)
and something is also wrong with the second argument itself, as it is the time object. The error was: 'second argument of / cannot be a "difftime" object'. Thank you in advance for your help!
I would approach in the following way. This joins the daily duration to the category duration, converts the difftime
objects to numbers and divides the two.
category_duration %>%
left_join(daily_duration, by = c("user_id", "date")) %>%
mutate(cat_duration_proportion = as.numeric(cat_duration, units = "secs") / as.numeric(duration, units = "secs"))
Your columns cat_duration
and duration
are not just numbers but of type difftime
. That is a data type for time differences and consist not only of a number, but also of a unit.
Does this answer help you? Divide two difftime objects
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.