繁体   English   中英

如何基于多个属性的相似时间线组合 R 中的数据帧,然后转换数据以将这些列作为行标题?

[英]How to combine dataframes in R based-on similar timelines for multiple attributes and then transforming the data to make these columns as row headers?

我正在尝试将我的销售数据和患者数据合并到 R(以及一些其他属性)中,这些数据是在同一时间范围内在国家/地区级别汇总的。 合并后,我想将其合并为长格式而不是宽格式,并在国家月级别保持其唯一性。

这是我的输入数据的样子 -

1) 销售数据

Coutry_ID   Country_Name    1/28/2018   2/28/2018   3/28/2018   4/28/2018   5/28/2018
A0001       USA               44           72         85          25          72
A0002       Germany           98           70         69          48          41
A0003       Russia            82           42         32          29          43
A0004       UK                79           83         51          48          47
A0005       France            45           75         10          13          23
A0006       India             92           85         28          13          18

2) 患者数据

Coutry_ID   Country_Name    1/28/2018   2/28/2018   3/28/2018   4/28/2018   5/28/2018
A0001       USA                7          13          22          23          13
A0002       Germany            9          10          17          25          25
A0003       Russia            24          19           6           8           5
A0004       UK                 6          8           20           1          11
A0005       France             4          9            8          10          25
A0006       India             18          21           2          13          17

这就是我打算输出的样子 -

Coutry_ID   Country_Name    Month       Sales   Patients
A0001       USA         1/28/2018       44      7
A0001       USA         2/28/2018       72      13
A0001       USA         3/28/2018       85      22
A0001       USA         4/28/2018       25      23
A0001       USA         5/28/2018       72      13
A0002       Germany     1/28/2018       98      9
A0002       Germany     2/28/2018       70      10
A0002       Germany     3/28/2018       69      17
A0002       Germany     4/28/2018       48      25
A0002       Germany     5/28/2018       41      25
A0003       Russia      1/28/2018       82      24
A0003       Russia      2/28/2018       42      19
A0003       Russia      3/28/2018       32      6
A0003       Russia      4/28/2018       29      8
A0003       Russia      5/28/2018       43      5
A0004       UK          1/28/2018       79      6
A0004       UK          2/28/2018       83      8
A0004       UK          3/28/2018       51      20
A0004       UK          4/28/2018       48      1
A0004       UK          5/28/2018       47      11
A0005       France      1/28/2018       45      4
A0005       France      2/28/2018       75      9
A0005       France      3/28/2018       10      8
A0005       France      4/28/2018       13      10
A0005       France      5/28/2018       23      25
A0006       India       1/28/2018       92      18
A0006       India       2/28/2018       85      21
A0006       India       3/28/2018       28      2
A0006       India       4/28/2018       13      13
A0006       India       5/28/2018       18      17

我需要关于这两件事的一些指导-

1 - 如何将数据从宽转换为长?

2 - 为了合并数据,我正在考虑在所有这些数据集上使用 DPLYR left_join 以及我的带有 ID 和名称的国家/地区主列表。 我的疑问是我是否应该先将数据集从宽格式转换为长格式,还是合并后再做?

您可以获得长格式的两个数据帧,然后加入:

library(dplyr)
library(tidyr)

inner_join(
   sales %>% pivot_longer(cols = -c(Coutry_ID, Country_Name), values_to = 'Sales'),
   patients %>% pivot_longer(cols = -c(Coutry_ID, Country_Name), 
                values_to = 'Patients'), 
       by = c("Coutry_ID", "Country_Name", "name"))

# A tibble: 30 x 5
#   Coutry_ID Country_Name name      Sales Patients
#   <fct>     <fct>        <chr>     <int>    <int>
# 1 A0001     USA          1/28/2018    44        7
# 2 A0001     USA          2/28/2018    72       13
# 3 A0001     USA          3/28/2018    85       22
# 4 A0001     USA          4/28/2018    25       23
# 5 A0001     USA          5/28/2018    72       13
# 6 A0002     Germany      1/28/2018    98        9
# 7 A0002     Germany      2/28/2018    70       10
# 8 A0002     Germany      3/28/2018    69       17
# 9 A0002     Germany      4/28/2018    48       25
#10 A0002     Germany      5/28/2018    41       25
# … with 20 more rows

数据

sales <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001", 
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"), 
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France", 
"Germany", "India", "Russia", "UK", "USA"), class = "factor"), 
`1/28/2018` = c(44L, 98L, 82L, 79L, 45L, 92L), `2/28/2018` = c(72L, 
70L, 42L, 83L, 75L, 85L), `3/28/2018` = c(85L, 69L, 32L, 
51L, 10L, 28L), `4/28/2018` = c(25L, 48L, 29L, 48L, 13L, 
13L), `5/28/2018` = c(72L, 41L, 43L, 47L, 23L, 18L)), class = 
"data.frame", row.names = c(NA, -6L))

patients <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001", 
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"), 
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France", 
"Germany", "India", "Russia", "UK", "USA"), class = "factor"), 
`1/28/2018` = c(7L, 9L, 24L, 6L, 4L, 18L), `2/28/2018` = c(13L, 
10L, 19L, 8L, 9L, 21L), `3/28/2018` = c(22L, 17L, 6L, 20L, 
8L, 2L), `4/28/2018` = c(23L, 25L, 8L, 1L, 10L, 13L), `5/28/2018` = c(13L, 
25L, 5L, 11L, 25L, 17L)), class = "data.frame", row.names = c(NA, -6L))

基础R(不像上面那样雄辩):

# Create a named list of dataframes:
df_list <- list(patients = patients, sales = sales)

# Create a vector in each with the name of the dataframe:
df_list <- mapply(cbind,  df_list, "desc" = as.character(names(df_list)),
                  SIMPLIFY = FALSE)

# Define a function to reshape the data:
reshape_ps <- function(x){

tmp <- setNames(reshape(x,
        direction = "long",
        varying = which(names(x) %in% names(x[,sapply(x, is.numeric)])),
        idvar = c(!(names(x) %in% names(x[,sapply(x, is.numeric)]))),
        v.names = "month",
        times = as.Date(names(x[,sapply(x, is.numeric)]), "%m/%d/%Y"),
        new.row.names = 1:(nrow(x)*length(which(names(x) %in% names(x[,sapply(x, is.numeric)]))))),
        c(names(x[!(names(x) %in% names(x[,sapply(x, is.numeric)]))]), "month", as.character(unique(x$desc))))

# Drop the dataframe name vector:
clean <- tmp[,names(tmp) != "desc"]

# Specify the return object:
return(clean)
}

# Merge the result of the function applied on both dataframes:
Reduce(function(y, z){merge(y, z, by = intersect(colnames(y), colnames(z)), all = TRUE)},
                            Map(function(x){reshape_ps(x)}, df_list))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM