DISCLAIMER: I'm not sure "collapse" is quite the right term for this operation. If there's a more appropriate term, I'm all ears.
I have data on symptom severity for several hundred patients from multiple observations over time. Severity is defined on an ordinal scale. Here's a simplified example:
# Create example dataset
example.dat <- data.frame(
ID = c(1,1,1,2,2,2,3,3,3,4,4,4), # patient ID numbers
Time = c("T1", "T2", "T3", "T1", "T2", "T3", # times at which data were collected
"T1", "T2", "T3", "T1", "T2", "T3"),
Severity = c("Mild", "Moderate", "Mild", # severity of symptoms
"Severe", "Severe", "Moderate",
"None", NA, "None",
"Moderate", "Moderate", "Mild")
)
# Specify the order of the factor levels
example.dat$Severity <- ordered(example.dat$Severity,
levels = c("None",
"Mild",
"Moderate",
"Severe")
)
example.dat
The resulting data frame looks like this:
ID Time Severity
1 1 T1 Mild
2 1 T2 Moderate
3 1 T3 Mild
4 2 T1 Severe
5 2 T2 Severe
6 2 T3 Moderate
7 3 T1 None
8 3 T2 <NA>
9 3 T3 None
10 4 T1 Moderate
11 4 T2 Moderate
12 4 T3 Mild
I would like to create a new column containing the most severe symptoms (ie the highest level of the ordered factor) observed for each ID, which would look like this:
ID Time Severity Worst
1 1 T1 Mild Moderate
2 1 T2 Moderate Moderate
3 1 T3 Mild Moderate
4 2 T1 Severe Severe
5 2 T2 Severe Severe
6 2 T3 Moderate Severe
7 3 T1 None None
8 3 T2 <NA> None
9 3 T3 None None
10 4 T1 Moderate Moderate
11 4 T2 Moderate Moderate
12 4 T3 Mild Moderate
From there, I can easily subset to create this data frame, which includes, for each ID, the time of the most recent observation and the worst symptoms reported during the study period:
ID Time Worst
3 1 T3 Moderate
6 2 T3 Severe
9 3 T3 None
12 4 T3 Moderate
Any thoughts?
You can find the maximum / most severe symptom by id using ave
example.dat$Worst <- ave(example.dat$Severity, example.dat$ID,
FUN = function(i) max(i, na.rm=TRUE))
The na.rm
option is used due to the missing values for some ID
You can then subset to only keep the most recent time.
Here's a solution using the aggregate
function in R:
example.dat <- data.frame(
ID = c(1,1,1,2,2,2,3,3,3,4,4,4), # patient ID numbers
Time = c("T1", "T2", "T3", "T1", "T2", "T3", # times at which data were collected
"T1", "T2", "T3", "T1", "T2", "T3"),
Severity = c("Mild", "Moderate", "Mild", # severity of symptoms
"Severe", "Severe", "Moderate",
"None", NA, "None",
"Moderate", "Moderate", "Mild")
)
# Specify the order of the factor levels
example.dat$Severity <- ordered(example.dat$Severity,
levels = c("None",
"Mild",
"Moderate",
"Severe")
)
new <- aggregate(Severity ~ ID , data = example.dat, FUN = max)
names(new)[names(new) == "Severity"] <- "Worst"
(final <- merge(example.dat, new))
Using dplyr
library(dplyr)
res <- example.dat %>%
group_by(ID) %>%
mutate(Worst=Severity[which.max(Severity)])
res
#Source: local data frame [12 x 4]
# Groups: ID
# ID Time Severity Worst
# 1 1 T1 Mild Moderate
# 2 1 T2 Moderate Moderate
# 3 1 T3 Mild Moderate
# 4 2 T1 Severe Severe
# 5 2 T2 Severe Severe
# 6 2 T3 Moderate Severe
# 7 3 T1 None None
# 8 3 T2 NA None
# 9 3 T3 None None
# 10 4 T1 Moderate Moderate
# 11 4 T2 Moderate Moderate
# 12 4 T3 Mild Moderate
filter(res, Time=="T3") %>% select(-Severity)
#Source: local data frame [4 x 4]
#Groups: ID
# ID Time Worst
# 1 1 T3 Moderate
# 2 2 T3 Severe
# 3 3 T3 None
# 4 4 T3 Moderate
Or data.table
library(data.table) ## 1.9.3
setDT(example.dat)[,Worst := Severity[which.max(Severity)], by=ID]
example.dat
You can get the latest version, 1.9.3 from here . If instead you'd like to use the CRAN version 1.9.2, then there's a tiny bug with factors which we'll have to take care of, which has been fixed in 1.9.3:
library(data.table) ## 1.9.2 from CRAN
setDT(example.dat)[, Worst := as.character(Severity)]
example.dat[, Worst := Worst[which.max(Severity)], by=ID]
Assuming that the data set is already ordered by ID,Time
, this'll get you the final solution directly:
require(data.table) ## 1.9.3
setDT(example.dat)[, list(Time=Time[.N], Worst=Severity[which.max(Severity)]), by=ID]
# ID Time Worst
# 1: 1 T3 Moderate
# 2: 2 T3 Severe
# 3: 3 T3 None
# 4: 4 T3 Moderate
setDT
converts the data.frame to data.table. Then, we group by ID
and get the last value of Time
in that group using .N
which is an integer vector of length 1 holding the number of observations in that group. And similarly we subset the corresponding maximum Severity
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.