简体   繁体   中英

R Rearrange columns in dataframe based on date values in column names

I've got a dataframe that has monthly survey scores for a certain hospitals. Each month, we store the score obtained by the hospital ( _Score column) and the corresponding average score for all hospitals for that month ( _Average column).

Here's a short sample of what it looks like -

df = data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                Question=c(rep("Q1",40)),
                key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                      "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                      "2020-05-31_Average","2020-05-31_Score"),4)),
                value=c(round(runif(40,0,1),2)))

library(tidyr)
df = df %>% spread(key,value)

I would like to transform this dataframe such that -

1) The first two columns, Hospital and Question remain the same

2) _Score columns for the three most recent months only are kept

3) _Average column for the most recent month is kept

4) Ideally, the columns need to be reordered from oldest to most recent (ie in the following order: Month M-2_Score , Month M-1_Score , Month M_Score , Month M_Average )

5) Calculate a column Variance at the end, which is the difference between Score M and Score M-1

What I'm trying to achieve

Using dplyr, this can be done manually by reordering the columns. But I'm looking for a way to build a logic that automatically reorders columns for the 3 most recent months in the sequence described above. By taking the date values embedded in the column names and reordering according to them.

The resulting table would look like this -

#Final table
df_transformed = df %>%
  select(1:2,8,10,12,11) %>%
  mutate(Variance=.[[5]]-.[[4]])

Any tips on how to do this more efficiently using the date values in the column names would be highly appreciated.

This is a possible solution if the columns in your dataset are already ordered chronologically

# create vectors of variables: 3 last "_Score" and 1 last "_Average"
score_vars <- tail(names(df)[grep("_Score", names(df))], 3)
average_var <- tail(names(df)[grep("_Average", names(df))], 1)

df %>% 
  select(Hospital, Question, !!score_vars, !!average_var) %>% 
  mutate(Variance = !!rlang::sym(score_vars[3]) - !!rlang::sym(score_vars[2]))

Output

# Hospital Question 2020-03-31_Score 2020-04-30_Score 2020-05-31_Score 2020-05-31_Average Variance
# 1 Hospital A       Q1             0.28             0.69             0.31               0.94    -0.38
# 2 Hospital B       Q1             0.19             0.41             0.27               0.91    -0.14
# 3 Hospital C       Q1             0.53             0.03             0.25               0.05     0.22
# 4 Hospital D       Q1             0.43             0.59             0.46               0.36    -0.13

I don't really get questions 4 and 5, but they feel a little like "Can you do my homework, please?". For questions 1 to 3, consider this:

library(tidyverse)
library(lubridate)

df <- data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                Question=c(rep("Q1",40)),
                key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                      "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                      "2020-05-31_Average","2020-05-31_Score"),4)),
                value=c(round(runif(40,0,1),2)))

# take the dataframe
df %>%
    # get month as a date and key separately
    mutate(month = str_replace(key, "_[[:alpha:]]*$", "") %>% ymd()
           , key = str_extract(key, "[[:alpha:]]*$")) %>%
    # filter Score for the last 3 and Average for the last 1 months
    filter(!(today() - month > months(3) & 
                 key == "Score")) %>%
    filter(!(today() - month > months(1) &
                 key == "Average"))

I have used your original df in long format before the spread step.

library(dplyr)
library(tidyr)

df %>%
  #Bring date and key in separate columns
  separate(key, c('Date', 'key'), sep = '_') %>%
  #Convert date column to date class
  mutate(Date = as.Date(Date)) %>%
  #arrange data according with highest date first
  arrange(Hospital, key, desc(Date)) %>%
  #For each hospital and key
  group_by(Hospital, key) %>%
  #If it is a "score" column select top 3 values and 
  #for average column select only 1 value
  slice(if(first(key) == 'Score') 1:3 else 1) %>%
  select(-Question) %>%
  ungroup() %>%
  #Get the data in wide format
  pivot_wider(names_from = c(key, Date), values_from = value) %>%
  #Calculate variance column
  mutate(Variance = .[[3]] - .[[4]])

# A tibble: 4 x 6
#  Hospital   `Average_2020-05-31` `Score_2020-05-31` `Score_2020-04-30` `Score_2020-03-31` Variance
#  <chr>                     <dbl>              <dbl>              <dbl>              <dbl>    <dbl>
#1 Hospital A                 0.45               0.44               0.66               0.97    -0.22
#2 Hospital B                 0.11               0.53               0.68               0.27    -0.15
#3 Hospital C                 1                  0.18               0.56               0.41    -0.38
#4 Hospital D                 0.31               0.83               0.6                0.79     0.23

To calculate variance .[[3]] -.[[4]] will be fixed because "Hospital" column is fixed and would always be 1st column. "Average" column would come before "Score" column (alphabetically) and since the data is sorted by Date , we know that highest date would be placed first, then second-highest and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM