I have a dataframe of patients in the format of one line per chest x-ray. My columns include a specific measurement on the chest x-ray, the date of the chest x-ray, and then several additional columns that are the same for a given patient (like final outcome).
For example:
+--------+------------+----------+------------+-------------+-----+-------+---------+
| pat_id | index_date | cxr_date | delta_date | cxr_measure | age | admit | outcome |
+--------+------------+----------+------------+-------------+-----+-------+---------+
| 1 | 1/2/2020 | 1/2/2020 | 0 | 0.1 | 55 | 1 | 0 |
| 1 | 1/2/2020 | 1/3/2020 | 1 | 0.3 | 55 | 1 | 0 |
| 1 | 1/2/2020 | 1/3/2020 | 1 | 0.5 | 55 | 1 | 0 |
| 2 | 2/1/2020 | 2/2/2020 | 1 | 0.2 | 59 | 0 | 0 |
| 2 | 2/1/2020 | 2/3/2020 | 2 | 0.9 | 59 | 0 | 0 |
| 3 | 1/6/2020 | 1/6/2020 | 0 | 0.7 | 66 | 1 | 1 |
+--------+------------+----------+------------+-------------+-----+-------+---------+
I want to reformat the table so it is one line per patient. My end table I think should look something like the below where each variable is turned into: cxr_measure_#
where #
is the delta_date
. In the real dataset, I'll have many of these columns (the # would range from -5 to +30). If there are two rows/values on the same delta_date, ideally I would want to take the mean.
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| pat_id | index_date | first_cxr_date | cxr_measure_0 | cxr_measure_1 | cxr_measure_2 | age | admit | outcome |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| 1 | 1/2/2020 | 1/2/2020 | 0.1 | 0.4 | NA | 55 | 1 | 0 |
| 2 | 2/1/2020 | 2/2/2020 | NA | 0.2 | 0.9 | 59 | 0 | 0 |
| 3 | 1/6/2020 | 1/6/2020 | 0.7 | NA | NA | 66 | 1 | 1 |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
Is there an easy way to basically reshape between these two tables? I've played a little bit with pivot_longer and pivot_wider, but wasn't sure how to (1) deal with getting the delta_date in the variable name and (2) how to take the mean if there are two overlapping dates. Also curious if this is easier accomplished in python (did most of the data curation using pandas, but then did some additional data cleaning and analysis in R).
to extend @Dave2e response, you can use a group_by
then min
to get first_cxr_date
by pat_id
, this lets you compose a neat functional solution.
library(tibble)
library(dplyr)
library(tidyr)
df <-
tribble(
~pat_id, ~index_date, ~cxr_date, ~delta_date, ~cxr_measure, ~age, ~admit, ~outcome,
1, '1/2/2020', '1/2/2020', 0, 0.1, 55, 1, 0,
1, '1/2/2020', '1/3/2020', 1, 0.3, 55, 1, 0,
1, '1/2/2020', '1/3/2020', 1, 0.5, 55, 1, 0,
2, '2/1/2020', '2/2/2020', 1, 0.2, 59, 0, 0,
2, '2/1/2020', '2/3/2020', 2, 0.9, 59, 0, 0,
3, '1/6/2020', '1/6/2020', 0, 0.7, 66, 1, 1)
df %>%
group_by(pat_id) %>% mutate(first_cxr_date = min(cxr_date)) %>% ungroup() %>% # set first_cxr_date as min of group by pat_id
pivot_wider(id_cols = -c(delta_date, cxr_measure, cxr_date)
, names_from = delta_date # column names from delta_date
, values_from = cxr_measure
, names_prefix = 'cxr_measure_' # paste string to column names
, values_fn = mean # combine with mean
)
# A tibble: 3 x 9
pat_id index_date age admit outcome first_cxr_date cxr_measure_0 cxr_measure_1 cxr_measure_2
<dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 1/2/2020 55 1 0 1/2/2020 0.1 0.4 NA
2 2 2/1/2020 59 0 0 2/2/2020 NA 0.2 0.9
3 3 1/6/2020 66 1 1 1/6/2020 0.7 NA NA
Here is hybrid approach using pivot_wider to calculate the means of the car_measures and the dplyr to summarize function to determine the first cxr_date.
df<- structure(list(pat_id = c(1L, 1L, 1L, 2L, 2L, 3L),
index_date = c("1/2/2020", "1/2/2020", "1/2/2020", "2/1/2020", "2/1/2020", "1/6/2020"),
cxr_date = c("1/2/2020", "1/3/2020", "1/3/2020", "2/2/2020", "2/3/2020", "1/6/2020"),
delta_date = c(0L, 1L, 1L, 1L, 2L, 0L),
cxr_measure = c(0.1, 0.3, 0.5, 0.2, 0.9, 0.7),
age = c(55L,55L, 55L, 59L, 59L, 66L),
admit = c(1L, 1L, 1L, 0L, 0L, 1L),
outcome = c(0L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -6L))
library(tidyr)
library(dplyr)
answer <-pivot_wider(df, id_cols = -c("delta_date", "cxr_measure", "cxr_date"),
names_from = "delta_date",
values_from = c("cxr_measure"),
values_fn = list(cxr_measure = mean),
names_glue ='cxr_measure_{delta_date}')
firstdate <-df %>% group_by(pat_id) %>% summarize(first_cxr_date=min(as.Date(cxr_date, "%m/%d/%Y")))
answer <- left_join(answer, firstdate)
Joining, by = "pat_id"
# A tibble: 3 x 9
pat_id index_date age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2 first_cxr_date
<int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <date>
1 1 1/2/2020 55 1 0 0.1 0.4 NA 2020-01-02
2 2 2/1/2020 59 0 0 NA 0.2 0.9 2020-02-02
3 3 1/6/2020 66 1 1 0.7 NA NA 2020-01-06
I sure there is a way to combine all of this into one function call, but sometime ugly is just faster.
Special Thanks to dear Mr. @Onyambu who taught me a valuable point today.
You can also use the following solution. Just note the .value
which is quite useful in particular with pivot_longer
when there are multiple column names to create from the data. Here it tells pivot_wider
that part of the name is actually the name of the column we take values from.
library(dplyr)
library(tidyr)
df %>%
group_by(pat_id) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = delta_date, values_from = cxr_measure,
names_glue = "{.value}_{delta_date}") %>%
mutate(across(cxr_measure_0:cxr_measure_2, ~ mean(.x, na.rm = TRUE))) %>%
select(-id) %>%
slice_head(n = 1)
# A tibble: 3 x 9
# Groups: pat_id [3]
pat_id index_date cxr_date age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2
<int> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 1/2/2020 1/2/2020 55 1 0 0.1 0.4 NaN
2 2 2/1/2020 2/2/2020 59 0 0 NaN 0.2 0.9
3 3 1/6/2020 1/6/2020 66 1 1 0.7 NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.