简体   繁体   中英

How do I combine row entries for the same patient ID# in R while keeping other columns and NA values?

I need to combine some of the columns for these multiple IDs and can just use the values from the first ID listing for the others. For example here I just want to combine the "spending" column as well as the heart attack column to just say whether they ever had a heart attack. I then want to delete the duplicate ID#s and just keep the values from the first listing for the other columns:

df <- read.table(text = 
              "ID  Age   Gender  heartattack    spending 
               1 24 f 0 140
               2 24 m na 123
               2 24 m 1 58
               2 24 m 0 na
               3 85 f 1 170
               4 45 m na 204", header=TRUE)

What I need:

df2 <- read.table(text = 
            "ID  Age   Gender  ever_heartattack all_spending 
             1 24 f 0 140
             2 24 m 1 181
             3 85 f 1 170
             4 45 m na 204", header=TRUE)

I tried group_by with transmute() and sum() as follows:

df$heartattack = as.numeric(as.character(df$heartattack))
df$spending  = as.numeric(as.character(df$spending))

library(dplyr)
df = df %>% group_by(ID) %>% transmute(ever_heartattack = sum(heartattack, na.rm = T), all_spending = sum(spending, na.rm=T))

But this removes all the other columns! Also it turns NA values into zeros...for example I still want "NA" to be the value for patient ID#4, I don't want to change the data to say they never had a heart attack!

> print(dfa) #This doesn't at all match df2 :(
  ID ever_heartattack all_spending
1  1                0          140
2  2                1          181
3  2                1          181
4  2                1          181
5  3                1          170
6  4                0          204

Could you do this?

aggregate(
    spending ~ ID + Age + Gender, 
    data = transform(df, spending = as.numeric(as.character(spending))),
    FUN = sum)
#  ID Age Gender spending
#1  1  24      f      140
#2  3  85      f      170
#3  2  24      m      181
#4  4  45      m      204

Some comments:

  1. The thing is that when aggregating you don't give clear rules how to deal with data in additional columns that differ (like heartattack in this case). For example, for ID = 2 why do you retain heartattack = 1 instead of heartattack = na or heartattack = 0 ?

  2. Your "na" s are in fact not real NA s. That leads to spending being a factor column instead of a numeric column vector.


To exactly reproduce your expected output one can do

df %>%
    mutate(
        heartattack = as.numeric(as.character(heartattack)),
        spending = as.numeric(as.character(spending))) %>%
    group_by(ID, Age, Gender) %>%
    summarise(
        heartattack = ifelse(
            any(heartattack %in% c(0, 1)),
            max(heartattack, na.rm = T),
            NA),
        spending = sum(spending, na.rm = T))
## A tibble: 4 x 5
## Groups:   ID, Age [?]
#     ID   Age Gender heartattack spending
#  <int> <int> <fct>        <dbl>    <dbl>
#1     1    24 f                0      140
#2     2    24 m                1      181
#3     3    85 f                1      170
#4     4    45 m               NA      204

This feels a bit "hacky" on account of the rules not being clear which heartattack value to keep. In this case we

  • keep the maximum value of heartattack if heartattack contains either 0 or 1.
  • return NA if heartattack does not contain 0 or 1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM