简体   繁体   中英

R studio: Decompose / pivot a dataframe to calculate the mean value for each variable at various time segments for each group

I have a dataframe that looks like this:

It contains the daily data for var1, var2, var3 for city A,B,C from 2020-01-01 to 2020-12-31.

+-----------+------------+------+------+------+
| Geography | Dates      | var1 | var2 | var3 |
+-----------+------------+------+------+------+
| A         | 2020-01-01 | 10   | 100  | 1    |
+-----------+------------+------+------+------+
| A         | 2020-01-02 | 20   | 200  | 2    |
+-----------+------------+------+------+------+
| A         | 2020-01-03 | 30   | 300  | 3    |
+-----------+------------+------+------+------+
| A         | ...        | ...  | ...  | ...  |
+-----------+------------+------+------+------+
| A         | 2020-12-31 | 50   | 500  | 5    |
+-----------+------------+------+------+------+
| B         | 2020-01-01 | 60   | 600  | 6    |
+-----------+------------+------+------+------+
| B         | 2020-01-02 | 70   | 700  | 7    |
+-----------+------------+------+------+------+
| B         | 2020-01-03 | 50   | 500  | 5    |
+-----------+------------+------+------+------+
| B         | ...        | ...  | ...  | ...  |
+-----------+------------+------+------+------+
| B         | 2020-12-31 | 60   | 600  | 6    |
+-----------+------------+------+------+------+
| C         | 2020-01-01 | 30   | 300  | 3    |
+-----------+------------+------+------+------+
| C         | 2020-01-02 | 40   | 400  | 4    |
+-----------+------------+------+------+------+
| C         | 2020-01-03 | 50   | 500  | 5    |
+-----------+------------+------+------+------+
| C         | ...        | ...  | ...  | ...  |
+-----------+------------+------+------+------+
| C         | 2020-12-31 | 70   | 700  | 7    |
+-----------+------------+------+------+------+

I would like to divide this data into 3 time segments:

  • (1) 2020-01-01 to 2020-05-30,
  • (2) 2020-06-01 to 2020-08-31,
  • (3) 2020-09-01 to 2020-12-31

And calculate the average for each variable , for each of those time segments , for each city , respectively.

The ending dataframe would like this:

+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| Geography | var1_seg1_avg | var1_seg2_avg | var1_seg3_avg | var2_seg1_avg | var2_seg2_avg | var2_seg3_avg | var3_seg1_avg | var3_seg2_avg | var3_seg3_avg |
+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| A         | xx            | xx            | xx            | xx            | xx            | xx            | xx            | xx            | xx            |
+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| B         | xx            | xx            | xx            | xx            | xx            | xx            | xx            | xx            | xx            |
+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| C         | xx            | xx            | xx            | xx            | xx            | xx            | xx            | xx            | xx            |
+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+

What is an elegant way to do this using tidyverse / dplyr?

Much appreciation for your help!

You can also try:

library(dplyr)
library(tidyr)
#Code
new <- df %>%
  #Create group
  mutate(Dates=as.Date(Dates),
         Group=ifelse(Dates>=as.Date('2020-01-01') & Dates<=as.Date('2020-05-30'),'Seg1',
                      ifelse(Dates>=as.Date('2020-06-01') & Dates<=as.Date('2020-08-31'),'Seg2',
                             ifelse(Dates>=as.Date('2020-09-01') & Dates<=as.Date('2020-12-31'),'Seg3',NA)))) %>%
  group_by(Geography,Group) %>%
  select(-Dates) %>%
  summarise_all(mean,na.rm=T) %>%
  pivot_wider(names_from = Group,values_from=c(var1:var3))

Output:

# A tibble: 3 x 7
# Groups:   Geography [3]
  Geography var1_Seg1 var1_Seg3 var2_Seg1 var2_Seg3 var3_Seg1 var3_Seg3
  <chr>         <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 A                20        50       200       500         2         5
2 B                60        60       600       600         6         6
3 C                40        70       400       700         4         7

Althought, @akrun solution is more efficient in terms of creating the groups and avoid conditionals.

Some data used:

#Data
df <- structure(list(Geography = c("A", "A", "A", "A", "B", "B", "B", 
"B", "C", "C", "C", "C"), Dates = c(" 2020-01-01 ", " 2020-01-02 ", 
" 2020-01-03 ", " 2020-12-31 ", " 2020-01-01 ", " 2020-01-02 ", 
" 2020-01-03 ", " 2020-12-31 ", " 2020-01-01 ", " 2020-01-02 ", 
" 2020-01-03 ", " 2020-12-31 "), var1 = c(10L, 20L, 30L, 50L, 
60L, 70L, 50L, 60L, 30L, 40L, 50L, 70L), var2 = c(100L, 200L, 
300L, 500L, 600L, 700L, 500L, 600L, 300L, 400L, 500L, 700L), 
    var3 = c(1L, 2L, 3L, 5L, 6L, 7L, 5L, 6L, 3L, 4L, 5L, 7L)), row.names = c(2L, 
4L, 6L, 9L, 11L, 13L, 15L, 18L, 20L, 22L, 24L, 27L), class = "data.frame")

An option would be to cut the 'Dates' to create a new column, do a group by mean and pivot to wide format

library(dplyr)
library(tidyr)
df1 %>%
   group_by(Geography, 
     seg = paste0("seg", as.integer(cut(as.Date(Dates),
     breaks = c("2020-01-01", "2020-06-01", "2020-09-01", "2020-12-31")))) %>%
   summarise(across(starts_with('var'), mean, na.rm = TRUE), .groups = 'drop') %>%
  pivot_wider(names_from = seg, values_from = starts_with('var'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM