简体   繁体   中英

Using pivot_wider or similar function with R with repeat measurement data

I have a dataframe of patients in the format of one line per chest x-ray. My columns include a specific measurement on the chest x-ray, the date of the chest x-ray, and then several additional columns that are the same for a given patient (like final outcome).

For example:

+--------+------------+----------+------------+-------------+-----+-------+---------+
| pat_id | index_date | cxr_date | delta_date | cxr_measure | age | admit | outcome |
+--------+------------+----------+------------+-------------+-----+-------+---------+
|      1 | 1/2/2020   | 1/2/2020 |          0 |         0.1 |  55 |     1 |       0 |
|      1 | 1/2/2020   | 1/3/2020 |          1 |         0.3 |  55 |     1 |       0 |
|      1 | 1/2/2020   | 1/3/2020 |          1 |         0.5 |  55 |     1 |       0 |
|      2 | 2/1/2020   | 2/2/2020 |          1 |         0.2 |  59 |     0 |       0 |
|      2 | 2/1/2020   | 2/3/2020 |          2 |         0.9 |  59 |     0 |       0 |
|      3 | 1/6/2020   | 1/6/2020 |          0 |         0.7 |  66 |     1 |       1 |
+--------+------------+----------+------------+-------------+-----+-------+---------+

I want to reformat the table so it is one line per patient. My end table I think should look something like the below where each variable is turned into: cxr_measure_# where # is the delta_date . In the real dataset, I'll have many of these columns (the # would range from -5 to +30). If there are two rows/values on the same delta_date, ideally I would want to take the mean.

+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| pat_id | index_date | first_cxr_date | cxr_measure_0 | cxr_measure_1 | cxr_measure_2 | age | admit | outcome |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
|      1 | 1/2/2020   | 1/2/2020       | 0.1           | 0.4           | NA           |  55 |     1 |       0 |
|      2 | 2/1/2020   | 2/2/2020       | NA            | 0.2           | 0.9          |  59 |     0 |       0 |
|      3 | 1/6/2020   | 1/6/2020       | 0.7           | NA            | NA           |  66 |     1 |       1 |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+

Is there an easy way to basically reshape between these two tables? I've played a little bit with pivot_longer and pivot_wider, but wasn't sure how to (1) deal with getting the delta_date in the variable name and (2) how to take the mean if there are two overlapping dates. Also curious if this is easier accomplished in python (did most of the data curation using pandas, but then did some additional data cleaning and analysis in R).

to extend @Dave2e response, you can use a group_by then min to get first_cxr_date by pat_id , this lets you compose a neat functional solution.

library(tibble)
library(dplyr)
library(tidyr)

df <- 
tribble( 
~pat_id,  ~index_date,  ~cxr_date,  ~delta_date,  ~cxr_measure,  ~age,  ~admit,  ~outcome, 
        1,  '1/2/2020',  '1/2/2020',          0,          0.1,   55,      1,        0, 
        1,  '1/2/2020',   '1/3/2020',           1,          0.3,   55,      1,        0, 
        1,  '1/2/2020',  '1/3/2020',          1,          0.5,   55,      1,        0, 
        2,  '2/1/2020',   '2/2/2020',           1,          0.2,   59,      0,        0, 
        2,  '2/1/2020',  '2/3/2020',          2,          0.9,   59,      0,        0, 
        3,  '1/6/2020',   '1/6/2020',           0,          0.7,   66,      1,        1)

df %>% 
  group_by(pat_id) %>% mutate(first_cxr_date = min(cxr_date)) %>% ungroup() %>% # set first_cxr_date as min of group by pat_id
  pivot_wider(id_cols = -c(delta_date, cxr_measure, cxr_date) 
              , names_from = delta_date # column names from delta_date
              , values_from = cxr_measure
              , names_prefix = 'cxr_measure_' # paste string to column names
              , values_fn = mean # combine with mean
              )
# A tibble: 3 x 9
  pat_id index_date   age admit outcome first_cxr_date cxr_measure_0 cxr_measure_1 cxr_measure_2
   <dbl> <chr>      <dbl> <dbl>   <dbl> <chr>                  <dbl>         <dbl>         <dbl>
1      1 1/2/2020      55     1       0 1/2/2020                 0.1           0.4          NA  
2      2 2/1/2020      59     0       0 2/2/2020                NA             0.2           0.9
3      3 1/6/2020      66     1       1 1/6/2020                 0.7          NA            NA  

Here is hybrid approach using pivot_wider to calculate the means of the car_measures and the dplyr to summarize function to determine the first cxr_date.

df<- structure(list(pat_id = c(1L, 1L, 1L, 2L, 2L, 3L), 
                    index_date = c("1/2/2020",  "1/2/2020", "1/2/2020", "2/1/2020", "2/1/2020", "1/6/2020"), 
                    cxr_date = c("1/2/2020", "1/3/2020", "1/3/2020", "2/2/2020",  "2/3/2020", "1/6/2020"), 
                    delta_date = c(0L, 1L, 1L, 1L, 2L, 0L), 
                    cxr_measure = c(0.1, 0.3, 0.5, 0.2, 0.9, 0.7), 
                    age = c(55L,55L, 55L, 59L, 59L, 66L), 
                    admit = c(1L, 1L, 1L, 0L, 0L, 1L), 
                    outcome = c(0L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -6L))

library(tidyr)
library(dplyr)

answer <-pivot_wider(df, id_cols = -c("delta_date", "cxr_measure", "cxr_date"), 
            names_from = "delta_date", 
            values_from = c("cxr_measure"),
            values_fn = list(cxr_measure = mean),
            names_glue ='cxr_measure_{delta_date}') 

 firstdate <-df %>% group_by(pat_id) %>% summarize(first_cxr_date=min(as.Date(cxr_date, "%m/%d/%Y")))
 
answer <- left_join(answer, firstdate)
Joining, by = "pat_id"
# A tibble: 3 x 9
  pat_id index_date   age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2 first_cxr_date
   <int>       <chr>   <int> <int>   <int>         <dbl>         <dbl>         <dbl>    <date>        
1      1    1/2/2020      55     1       0           0.1           0.4          NA   2020-01-02    
2      2    2/1/2020      59     0       0          NA             0.2           0.9 2020-02-02    
3      3    1/6/2020      66     1       1           0.7          NA            NA   2020-01-06

I sure there is a way to combine all of this into one function call, but sometime ugly is just faster.

Special Thanks to dear Mr. @Onyambu who taught me a valuable point today.

You can also use the following solution. Just note the .value which is quite useful in particular with pivot_longer when there are multiple column names to create from the data. Here it tells pivot_wider that part of the name is actually the name of the column we take values from.

library(dplyr)
library(tidyr)


df %>%
  group_by(pat_id) %>%
  mutate(id = row_number()) %>%
  pivot_wider(names_from = delta_date, values_from = cxr_measure, 
              names_glue = "{.value}_{delta_date}") %>%
  mutate(across(cxr_measure_0:cxr_measure_2, ~ mean(.x, na.rm = TRUE))) %>%
  select(-id) %>%
  slice_head(n = 1)


# A tibble: 3 x 9
# Groups:   pat_id [3]
  pat_id index_date cxr_date   age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2
   <int> <chr>      <chr>    <int> <int>   <int>         <dbl>         <dbl>         <dbl>
1      1 1/2/2020   1/2/2020    55     1       0           0.1           0.4         NaN  
2      2 2/1/2020   2/2/2020    59     0       0         NaN             0.2           0.9
3      3 1/6/2020   1/6/2020    66     1       1           0.7         NaN           NaN 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM