简体   繁体   中英

How to combine dataframes in R based-on similar timelines for multiple attributes and then transforming the data to make these columns as row headers?

I am trying to merge my sales data and patients data in R (and some other attributes) which are rolled-up at the country level for the same time-frame. After merging, I want to consolidate it to a long format instead of wide format and keep it unique at the Country-Month level.

This is how my input data looks like -

1) Sales Data

Coutry_ID   Country_Name    1/28/2018   2/28/2018   3/28/2018   4/28/2018   5/28/2018
A0001       USA               44           72         85          25          72
A0002       Germany           98           70         69          48          41
A0003       Russia            82           42         32          29          43
A0004       UK                79           83         51          48          47
A0005       France            45           75         10          13          23
A0006       India             92           85         28          13          18

2) Patients Data

Coutry_ID   Country_Name    1/28/2018   2/28/2018   3/28/2018   4/28/2018   5/28/2018
A0001       USA                7          13          22          23          13
A0002       Germany            9          10          17          25          25
A0003       Russia            24          19           6           8           5
A0004       UK                 6          8           20           1          11
A0005       France             4          9            8          10          25
A0006       India             18          21           2          13          17

AND this is how I intend output to look like -

Coutry_ID   Country_Name    Month       Sales   Patients
A0001       USA         1/28/2018       44      7
A0001       USA         2/28/2018       72      13
A0001       USA         3/28/2018       85      22
A0001       USA         4/28/2018       25      23
A0001       USA         5/28/2018       72      13
A0002       Germany     1/28/2018       98      9
A0002       Germany     2/28/2018       70      10
A0002       Germany     3/28/2018       69      17
A0002       Germany     4/28/2018       48      25
A0002       Germany     5/28/2018       41      25
A0003       Russia      1/28/2018       82      24
A0003       Russia      2/28/2018       42      19
A0003       Russia      3/28/2018       32      6
A0003       Russia      4/28/2018       29      8
A0003       Russia      5/28/2018       43      5
A0004       UK          1/28/2018       79      6
A0004       UK          2/28/2018       83      8
A0004       UK          3/28/2018       51      20
A0004       UK          4/28/2018       48      1
A0004       UK          5/28/2018       47      11
A0005       France      1/28/2018       45      4
A0005       France      2/28/2018       75      9
A0005       France      3/28/2018       10      8
A0005       France      4/28/2018       13      10
A0005       France      5/28/2018       23      25
A0006       India       1/28/2018       92      18
A0006       India       2/28/2018       85      21
A0006       India       3/28/2018       28      2
A0006       India       4/28/2018       13      13
A0006       India       5/28/2018       18      17

I need a little guidance on these 2 things -

1 - How to convert the data from wide to long?

2 - For merging data, I am thinking about using DPLYR left_join on all these data-sets with my master list of countries with ID and Name. My doubt is whether I should first convert the data sets into The long format from wide or do that after merging?

You can get both the dataframes in long format and then join :

library(dplyr)
library(tidyr)

inner_join(
   sales %>% pivot_longer(cols = -c(Coutry_ID, Country_Name), values_to = 'Sales'),
   patients %>% pivot_longer(cols = -c(Coutry_ID, Country_Name), 
                values_to = 'Patients'), 
       by = c("Coutry_ID", "Country_Name", "name"))

# A tibble: 30 x 5
#   Coutry_ID Country_Name name      Sales Patients
#   <fct>     <fct>        <chr>     <int>    <int>
# 1 A0001     USA          1/28/2018    44        7
# 2 A0001     USA          2/28/2018    72       13
# 3 A0001     USA          3/28/2018    85       22
# 4 A0001     USA          4/28/2018    25       23
# 5 A0001     USA          5/28/2018    72       13
# 6 A0002     Germany      1/28/2018    98        9
# 7 A0002     Germany      2/28/2018    70       10
# 8 A0002     Germany      3/28/2018    69       17
# 9 A0002     Germany      4/28/2018    48       25
#10 A0002     Germany      5/28/2018    41       25
# … with 20 more rows

data

sales <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001", 
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"), 
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France", 
"Germany", "India", "Russia", "UK", "USA"), class = "factor"), 
`1/28/2018` = c(44L, 98L, 82L, 79L, 45L, 92L), `2/28/2018` = c(72L, 
70L, 42L, 83L, 75L, 85L), `3/28/2018` = c(85L, 69L, 32L, 
51L, 10L, 28L), `4/28/2018` = c(25L, 48L, 29L, 48L, 13L, 
13L), `5/28/2018` = c(72L, 41L, 43L, 47L, 23L, 18L)), class = 
"data.frame", row.names = c(NA, -6L))

patients <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001", 
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"), 
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France", 
"Germany", "India", "Russia", "UK", "USA"), class = "factor"), 
`1/28/2018` = c(7L, 9L, 24L, 6L, 4L, 18L), `2/28/2018` = c(13L, 
10L, 19L, 8L, 9L, 21L), `3/28/2018` = c(22L, 17L, 6L, 20L, 
8L, 2L), `4/28/2018` = c(23L, 25L, 8L, 1L, 10L, 13L), `5/28/2018` = c(13L, 
25L, 5L, 11L, 25L, 17L)), class = "data.frame", row.names = c(NA, -6L))

Base R (not as eloquent as above):

# Create a named list of dataframes:
df_list <- list(patients = patients, sales = sales)

# Create a vector in each with the name of the dataframe:
df_list <- mapply(cbind,  df_list, "desc" = as.character(names(df_list)),
                  SIMPLIFY = FALSE)

# Define a function to reshape the data:
reshape_ps <- function(x){

tmp <- setNames(reshape(x,
        direction = "long",
        varying = which(names(x) %in% names(x[,sapply(x, is.numeric)])),
        idvar = c(!(names(x) %in% names(x[,sapply(x, is.numeric)]))),
        v.names = "month",
        times = as.Date(names(x[,sapply(x, is.numeric)]), "%m/%d/%Y"),
        new.row.names = 1:(nrow(x)*length(which(names(x) %in% names(x[,sapply(x, is.numeric)]))))),
        c(names(x[!(names(x) %in% names(x[,sapply(x, is.numeric)]))]), "month", as.character(unique(x$desc))))

# Drop the dataframe name vector:
clean <- tmp[,names(tmp) != "desc"]

# Specify the return object:
return(clean)
}

# Merge the result of the function applied on both dataframes:
Reduce(function(y, z){merge(y, z, by = intersect(colnames(y), colnames(z)), all = TRUE)},
                            Map(function(x){reshape_ps(x)}, df_list))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM