简体   繁体   中英

R: How to collapse multiple observations of an ordered factor within a data frame?

DISCLAIMER: I'm not sure "collapse" is quite the right term for this operation. If there's a more appropriate term, I'm all ears.

I have data on symptom severity for several hundred patients from multiple observations over time. Severity is defined on an ordinal scale. Here's a simplified example:

# Create example dataset
example.dat <- data.frame(
  ID = c(1,1,1,2,2,2,3,3,3,4,4,4),  # patient ID numbers
  Time = c("T1", "T2", "T3", "T1", "T2", "T3",  # times at which data were collected
           "T1", "T2", "T3", "T1", "T2", "T3"),
  Severity = c("Mild", "Moderate", "Mild",  # severity of symptoms
          "Severe", "Severe", "Moderate",
          "None", NA, "None",
          "Moderate", "Moderate", "Mild")
)

# Specify the order of the factor levels
example.dat$Severity <- ordered(example.dat$Severity,
                                levels = c("None",
                                           "Mild",
                                           "Moderate",
                                           "Severe")
                                )

example.dat

The resulting data frame looks like this:

   ID Time Severity
1   1   T1     Mild
2   1   T2 Moderate
3   1   T3     Mild
4   2   T1   Severe
5   2   T2   Severe
6   2   T3 Moderate
7   3   T1     None
8   3   T2     <NA>
9   3   T3     None
10  4   T1 Moderate
11  4   T2 Moderate
12  4   T3     Mild

I would like to create a new column containing the most severe symptoms (ie the highest level of the ordered factor) observed for each ID, which would look like this:

   ID Time Severity    Worst
1   1   T1     Mild Moderate
2   1   T2 Moderate Moderate
3   1   T3     Mild Moderate
4   2   T1   Severe   Severe
5   2   T2   Severe   Severe
6   2   T3 Moderate   Severe
7   3   T1     None     None
8   3   T2     <NA>     None
9   3   T3     None     None
10  4   T1 Moderate Moderate
11  4   T2 Moderate Moderate
12  4   T3     Mild Moderate

From there, I can easily subset to create this data frame, which includes, for each ID, the time of the most recent observation and the worst symptoms reported during the study period:

   ID Time    Worst
3   1   T3 Moderate
6   2   T3   Severe
9   3   T3     None
12  4   T3 Moderate

Any thoughts?

You can find the maximum / most severe symptom by id using ave

example.dat$Worst <- ave(example.dat$Severity, example.dat$ID, 
                                      FUN = function(i) max(i, na.rm=TRUE)) 

The na.rm option is used due to the missing values for some ID

You can then subset to only keep the most recent time.

Here's a solution using the aggregate function in R:

example.dat <- data.frame(
ID = c(1,1,1,2,2,2,3,3,3,4,4,4),  # patient ID numbers
Time = c("T1", "T2", "T3", "T1", "T2", "T3",  # times at which data were collected
       "T1", "T2", "T3", "T1", "T2", "T3"),
Severity = c("Mild", "Moderate", "Mild",  # severity of symptoms
      "Severe", "Severe", "Moderate",
      "None", NA, "None",
      "Moderate", "Moderate", "Mild")
)

# Specify the order of the factor levels
example.dat$Severity <- ordered(example.dat$Severity,
                            levels = c("None",
                                       "Mild",
                                       "Moderate",
                                       "Severe")
                            )


new <- aggregate(Severity ~ ID , data = example.dat, FUN = max)
names(new)[names(new) == "Severity"] <- "Worst"
(final <- merge(example.dat, new))

Using dplyr

library(dplyr)
 res <- example.dat %>%
 group_by(ID) %>% 
 mutate(Worst=Severity[which.max(Severity)])

res
#Source: local data frame [12 x 4]
# Groups: ID

#    ID Time Severity    Worst
# 1   1   T1     Mild Moderate
# 2   1   T2 Moderate Moderate
# 3   1   T3     Mild Moderate
# 4   2   T1   Severe   Severe
# 5   2   T2   Severe   Severe
# 6   2   T3 Moderate   Severe
# 7   3   T1     None     None
# 8   3   T2       NA     None
# 9   3   T3     None     None
# 10  4   T1 Moderate Moderate
# 11  4   T2 Moderate Moderate
# 12  4   T3     Mild Moderate

 filter(res, Time=="T3") %>% select(-Severity)
#Source: local data frame [4 x 4]
#Groups: ID
#   ID Time    Worst
# 1  1   T3 Moderate
# 2  2   T3   Severe
# 3  3   T3     None
# 4  4   T3 Moderate

Or data.table

library(data.table) ## 1.9.3
setDT(example.dat)[,Worst := Severity[which.max(Severity)], by=ID]    
example.dat

You can get the latest version, 1.9.3 from here . If instead you'd like to use the CRAN version 1.9.2, then there's a tiny bug with factors which we'll have to take care of, which has been fixed in 1.9.3:

library(data.table) ## 1.9.2 from CRAN
setDT(example.dat)[, Worst := as.character(Severity)]
example.dat[, Worst := Worst[which.max(Severity)], by=ID]

Assuming that the data set is already ordered by ID,Time , this'll get you the final solution directly:

require(data.table) ## 1.9.3
setDT(example.dat)[, list(Time=Time[.N], Worst=Severity[which.max(Severity)]), by=ID]
#    ID Time    Worst
# 1:  1   T3 Moderate
# 2:  2   T3   Severe
# 3:  3   T3     None
# 4:  4   T3 Moderate

setDT converts the data.frame to data.table. Then, we group by ID and get the last value of Time in that group using .N which is an integer vector of length 1 holding the number of observations in that group. And similarly we subset the corresponding maximum Severity .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM