简体   繁体   中英

Merge/combine rows with same ID and Date in R

I have an excel database like below. The Excel database had option to enter only 3 drug details. Wherever there are more than 3 drugs, it has been entered into another row with PID and Date.

Is there a way I can merge the rows in R so that each patient's records will be in a single row? In the example below, I need to merge Row 1 & 2 and 4 & 6.

Thanks.

Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
2 11A 25/10/2021 BET 10 SET 43 BLT 45
3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
6 13A 22/10/2021 DFS 5
7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK

Up front, the two methods below are completely different, not equivalents in "base R vs dplyr". I'm sure either can be translated to the other.

dplyr

The premise here is to first reshape/pivot the data longer so that each Drug/Dose is on its own line, renumber them appropriately, and then bring it back to a wide state.

NOTE : frankly, I usually prefer to deal with data in a long format, so consider keeping it in its state immediately before pivot_wider . This means you'd need to bring Age and Place back into it somehow.

Why? A long format deals very well with many types of aggregation; ggplot2 really really prefers data in the long format; I dislike seeing and having to deal with all of the NA /empty values that will invariably happen with this wide format, since many PIDs don't have (eg) Drug6 or later. This seems subjective, but it can really be an objective change/improvement to data-mangling, depending on your workflow.

library(dplyr)
# library(tidyr) # pivot_longer, pivot_wider
dat0 <- select(dat, PID, Date, Age, Place) %>%
  group_by(PID, Date) %>%
  summarize(across(everything(), ~ .[!is.na(.) & nzchar(trimws(.))][1] ))
dat %>%
  select(-Age, -Place) %>%
  tidyr::pivot_longer(
    -c(Row, PID, Date),
    names_to = c(".value", "iter"),
    names_pattern = "^([^0-9]+)([123]?)$") %>%
  arrange(Row, iter) %>%
  group_by(PID, Date) %>%
  mutate(iter = row_number()) %>%
  select(-Row) %>%
  tidyr::pivot_wider(
    c("PID", "Date"), names_sep = "",
    names_from = "iter", values_from = c("Drug", "Dose")) %>%
  left_join(dat0, by = c("PID", "Date"))
# # A tibble: 5 x 16
# # Groups:   PID, Date [5]
#   PID   Date       Drug1 Drug2 Drug3 Drug4 Drug5 Drug6 Dose1 Dose2 Dose3 Dose4 Dose5 Dose6   Age Place
#   <chr> <chr>      <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
# 1 11A   25/10/2021 RPG   NAT   QRT   BET   "SET" "BLT"    12    34     5    10    43    45    45 PMk  
# 2 12B   20/10/2021 ATY   LTP   CRT   <NA>   <NA>  <NA>    13     3     3    NA    NA    NA    56 GTL  
# 3 13A   22/10/2021 GGS   GSF   ERE   DFS   ""    ""        7    12    45     5    NA    NA    45 RKS  
# 4 13A   26/10/2021 BRT   ARR   GSF   <NA>   <NA>  <NA>     9     4    34    NA    NA    NA    46 GLO  
# 5 14B   04/08/2021 GDS   TRE   HHS   <NA>   <NA>  <NA>     2    55    34    NA    NA    NA    25 MTK  

Notes:

  • I broke out dat0 early, since Age and Place don't really fit into the pivot/renumber/pivot mindset.

base R

Here's a base R method that splits (according to your grouping criteria: PID and Date ), finds the Drug/Dose columns that need to be renumbered, renames them, and the merge s all of the frames back together.

spl <- split(dat, ave(rep(1L, nrow(dat)), dat[,c("PID", "Date")], FUN = seq_along))
spl
# $`1`
#   Row PID       Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1   1 11A 25/10/2021   RPG    12   NAT    34   QRT     5  45   PMk
# 3   3 12B 20/10/2021   ATY    13   LTP     3   CRT     3  56   GTL
# 4   4 13A 22/10/2021   GGS     7   GSF    12   ERE    45  45   RKS
# 5   5 13A 26/10/2021   BRT     9   ARR     4   GSF    34  46   GLO
# 7   7 14B 04/08/2021   GDS     2   TRE    55   HHS    34  25   MTK
# $`2`
#   Row PID       Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 2   2 11A 25/10/2021   BET    10   SET    43   BLT    45  NA      
# 6   6 13A 22/10/2021   DFS     5          NA          NA  NA      

nms <- lapply(spl, function(x) grep("^(Drug|Dose)", colnames(x), value = TRUE))
nms <- data.frame(i = rep(names(nms), lengths(nms)), oldnm = unlist(nms))
nms$grp <- gsub("[0-9]+$", "", nms$oldnm)
nms$newnm <- paste0(nms$grp, ave(nms$grp, nms$grp, FUN = seq_along))
nms <- split(nms, nms$i)

newspl <- Map(function(x, nm) {
  colnames(x)[ match(nm$oldnm, colnames(x)) ] <- nm$newnm
  x
}, spl, nms)
newspl[-1] <- lapply(newspl[-1], function(x) x[, c("PID", "Date", grep("^(Drug|Dose)", colnames(x), value = TRUE)), drop = FALSE ])
newspl
# $`1`
#   Row PID       Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1   1 11A 25/10/2021   RPG    12   NAT    34   QRT     5  45   PMk
# 3   3 12B 20/10/2021   ATY    13   LTP     3   CRT     3  56   GTL
# 4   4 13A 22/10/2021   GGS     7   GSF    12   ERE    45  45   RKS
# 5   5 13A 26/10/2021   BRT     9   ARR     4   GSF    34  46   GLO
# 7   7 14B 04/08/2021   GDS     2   TRE    55   HHS    34  25   MTK
# $`2`
#   PID       Date Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 2 11A 25/10/2021   BET    10   SET    43   BLT    45
# 6 13A 22/10/2021   DFS     5          NA          NA

Reduce(function(a, b) merge(a, b, by = c("PID", "Date"), all = TRUE), newspl)
#   PID       Date Row Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 1 11A 25/10/2021   1   RPG    12   NAT    34   QRT     5  45   PMk   BET    10   SET    43   BLT    45
# 2 12B 20/10/2021   3   ATY    13   LTP     3   CRT     3  56   GTL  <NA>    NA  <NA>    NA  <NA>    NA
# 3 13A 22/10/2021   4   GGS     7   GSF    12   ERE    45  45   RKS   DFS     5          NA          NA
# 4 13A 26/10/2021   5   BRT     9   ARR     4   GSF    34  46   GLO  <NA>    NA  <NA>    NA  <NA>    NA
# 5 14B 04/08/2021   7   GDS     2   TRE    55   HHS    34  25   MTK  <NA>    NA  <NA>    NA  <NA>    NA

Notes:

  • The underlying premise of this is that you want to merge the rows onto previous rows. This means (to me) using base::merge or dplyr::full_join ; two good links for understanding these concepts, in case you are not aware: How to join (merge) data frames (inner, outer, left, right) , What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?

  • To do that, we need to determine which rows are duplicates of previous; further, we need to know how many previous same-key rows there are. There are a few ways to do this, but I think the easiest is with base::split . In this case, no PID/Date combination has more than two rows, but if you had one combination that mandated a third row, spl would be length-3, and the resulting names would go out to Drug9 / Dose9 .

  • The second portion ( nms <- ... ) is where we work on the names. The first few steps create a nms dataframe that we'll use to map from old to new names. Since we're concerned about contiguous numbering through all multi-row groups, we aggregate on the base (number removed) of the Drug/Dose names, so that we number all Drug columns from Drug1 through how many there are.

    Note : this assumes that there are always perfect pairs of Drug# / Dose# ; if there is ever a mismatch, then the numbering will be suspect.

    We end with nms being a split dataframe, just like spl of the data. This is useful and important, since we'll Map (zip-like lapply ) them together.

  • The third block updates spl with the new names. The result in newspl is just renaming of the columns so that when we merge them together, no column-duplication will occur.

    One additional step here is removing unrelated columns from the 2nd and subsequent frame in the list. That is, we keep Age and Place in the first such frame but remove it from the rest. My assumption (based on the NA /empty nature of those fields in duplicate rows) is that we only want to keep the first row's values.

  • The last step is to iteratively merge them together. The Reduce function is nice for this.

Another tidyverse -based solution, with a pivot_longer followed by a pivot_wider :

library(tidyverse)

# Note that my dataframe does not contain column Row

df %>% 
  mutate(across(starts_with("Dose"), as.character)) %>% 
  pivot_longer(!c(PID, Date, Age, Place),names_to = "trm") %>% 
  group_by(PID, Date) %>% 
  fill(Age, Place) %>% 
  mutate(trm = paste(trm,1:n(),sep="_")) %>% 
  ungroup %>% 
  pivot_wider(c(PID, Date, Age, Place), names_from = trm) %>% 
  rename_with(~ paste0("Drug",1:length(.x)), starts_with("Drug")) %>% 
  rename_with(~ paste0("Dose",1:length(.x)), starts_with("Dose")) %>% 
  mutate(across(starts_with("Dose"), as.numeric))

#> # A tibble: 5 × 16
#>   PID   Date     Age Place Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Drug4 Dose4 Drug5
#>   <chr> <chr>  <int> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
#> 1 11A   25/10…    45 PMk   RPG      12 NAT      34 QRT       5 BET      10 SET  
#> 2 12B   20/10…    56 GTL   ATY      13 LTP       3 CRT       3 <NA>     NA <NA> 
#> 3 13A   22/10…    45 RKS   GGS       7 GSF      12 ERE      45 DFS       5 <NA> 
#> 4 13A   26/10…    46 GLO   BRT       9 ARR       4 GSF      34 <NA>     NA <NA> 
#> 5 14B   04/08…    25 MTK   GDS       2 TRE      55 HHS      34 <NA>     NA <NA> 
#> # … with 3 more variables: Dose5 <dbl>, Drug6 <chr>, Dose6 <dbl>

Update:

With the help of akrun see here: Use ~separate after mutate and across

We could:

library(dplyr)
library(stringr)
library(tidyr)
df %>% 
  group_by(PID) %>% 
  summarise(across(everything(), ~toString(.))) %>% 
  mutate(across(everything(), ~ list(tibble(col1 = .) %>% 
                             separate(col1, into = str_c(cur_column(), 1:3), sep = ",\\s+", fill = "left", extra = "drop")))) %>% 
  unnest(c(PID, Row, Date, Drug1, Dose1, Drug2, Dose2, Drug3, Dose3, Age, 
           Place)) %>% 
  distinct() %>% 
  select(-1, -2)
  PID3  Row1  Row2  Row3  Date1      Date2      Date3      Drug11 Drug12 Drug13 Dose11 Dose12 Dose13 Drug21 Drug22 Drug23 Dose21 Dose22 Dose23 Drug31 Drug32 Drug33 Dose31 Dose32 Dose33 Age1  Age2  Age3  Place1 Place2 Place3
  <chr> <chr> <chr> <chr> <chr>      <chr>      <chr>      <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> <chr> <chr> <chr>  <chr>  <chr> 
1 11A   NA    1     2     NA         25/10/2021 25/10/2021 NA     RPG    BET    NA     12     10     NA     NAT    SET    NA     34     43     NA     QRT    BLT    NA     5      45     NA    45    NA    NA     PMk    NA    
2 12B   NA    NA    3     NA         NA         20/10/2021 NA     NA     ATY    NA     NA     13     NA     NA     LTP    NA     NA     3      NA     NA     CRT    NA     NA     3      NA    NA    56    NA     NA     GTL   
3 13A   4     5     6     22/10/2021 26/10/2021 22/10/2021 GGS    BRT    DFS    7      9      5      GSF    ARR    NA     12     4      NA     ERE    GSF    NA     45     34     NA     45    46    NA    RKS    GLO    NA    
4 14B   NA    NA    7     NA         NA         04/08/2021 NA     NA     GDS    NA     NA     2      NA     NA     TRE    NA     NA     55     NA     NA     HHS    NA     NA     34     NA    NA    25    NA     NA     MTK   

First answer: Keeping the excellent explanation of @r2evans in mind! We could do it this way if really desired.

library(dplyr)
df %>% 
  group_by(PID) %>% 
  summarise(across(everything(), ~toString(.)))

output:

  PID   Row     Date                               Drug1         Dose1   Drug2        Dose2     Drug3        Dose3      Age        Place       
  <chr> <chr>   <chr>                              <chr>         <chr>   <chr>        <chr>     <chr>        <chr>      <chr>      <chr>       
1 11A   1, 2    25/10/2021, 25/10/2021             RPG, BET      12, 10  NAT, SET     34, 43    QRT, BLT     5, 45      45, NA     PMk, NA     
2 12B   3       20/10/2021                         ATY           13      LTP          3         CRT          3          56         GTL         
3 13A   4, 5, 6 22/10/2021, 26/10/2021, 22/10/2021 GGS, BRT, DFS 7, 9, 5 GSF, ARR, NA 12, 4, NA ERE, GSF, NA 45, 34, NA 45, 46, NA RKS, GLO, NA
4 14B   7       04/08/2021                         GDS           2       TRE          55        HHS          34         25         MTK  

a data.table approach

library(data.table)
DT <- fread("Row    PID     Date    Drug1   Dose1   Drug2   Dose2   Drug3   Dose3   Age     Place
            1   11A     25/10/2021  RPG     12  NAT     34  QRT     5   45  PMk
            2   11A     25/10/2021  BET     10  SET     43  BLT     45      
            3   12B     20/10/2021  ATY     13  LTP     3   CRT     3   56  GTL
            4   13A     22/10/2021  GGS     7   GSF     12  ERE     45  45  RKS
            5   13A     26/10/2021  BRT     9   ARR     4   GSF     34  46  GLO
            6   13A     22/10/2021  DFS     5                       
            7   14B     04/08/2021  GDS     2   TRE     55  HHS     34  25  MTK")

dcast(DT)


DT
# Melt to long format
ans <- melt(DT, id.vars = c("PID", "Date"), 
     measure.vars = patterns(drug = "^Drug", dose = "^Dose"), 
     na.rm = TRUE)
# Paste and Collapse, use ; as separator
ans <- ans[, lapply(.SD, paste0, collapse = ";"), by = .(PID, Date)]
# Split string on ;
ans[, paste0("Drug", 1:length(tstrsplit(ans$drug, ";"))) := tstrsplit(drug, ";")]
ans[, paste0("Dose", 1:length(tstrsplit(ans$dose, ";"))) := tstrsplit(dose, ";")]
#join Age + Place data
ans[DT[!is.na(Age), ], `:=`(Age = i.Age, Place = i.Place), on = .(PID, Date)]
ans[, -c("variable", "drug", "dose")]
#    PID       Date Drug1 Drug2 Drug3 Drug4 Drug5 Drug6 Dose1 Dose2 Dose3 Dose4 Dose5 Dose6 Age Place
# 1: 11A 25/10/2021   RPG   BET   NAT   SET   QRT   BLT    12    10    34    43     5    45  45   PMk
# 2: 12B 20/10/2021   ATY   LTP   CRT  <NA>  <NA>  <NA>    13     3     3  <NA>  <NA>  <NA>  56   GTL
# 3: 13A 22/10/2021   GGS   DFS   GSF   ERE  <NA>  <NA>     7     5    12    45  <NA>  <NA>  45   RKS
# 4: 13A 26/10/2021   BRT   ARR   GSF  <NA>  <NA>  <NA>     9     4    34  <NA>  <NA>  <NA>  46   GLO
# 5: 14B 04/08/2021   GDS   TRE   HHS  <NA>  <NA>  <NA>     2    55    34  <NA>  <NA>  <NA>  25   MTK

Another answer to the festival.

Reading data from this page

require(rvest)
require(tidyverse)
d = read_html("https://stackoverflow.com/q/69787018/694915") %>%
  html_nodes("table") %>%
  html_table(fill = TRUE) 

List of dose per PID and DATE

# primera tabla
d[[1]]  -> df

df %>% 
  pivot_longer(
    cols = starts_with("Drug"),
    values_to = "Drug"
  ) %>%
  select( !name ) %>% 
  pivot_longer(
    cols = starts_with("Dose"),
    values_to = "Dose"
  ) %>%
  select( !name ) %>%
  drop_na() %>%  
  pivot_wider(
    names_from = Drug,
    values_from = Dose ,
    values_fill = list(0)
  ) -> dose

Variable dose contains this data

剂量表 ( https://i.stack.imgur.com/lc3iN.png )

Not that elegant as previous ones, but is an idea to see the whole treatment per PID.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM