简体   繁体   中英

With data.table in R, how do I efficiently replace multiple column values named within a single column?

In R, I have a data table with a column ("prncl_diag") that has values of diagnoses. These diagnosis values (in the prncl_diag column) all appear as columns in the data table as well. There are ~2.5K diagnosis columns of which a subset appear as values in the "prncl_diag" column.

I want to update the diagnosis indicator columns with a 1 if its name appears in a given row of "prncl_diag" column.

That isn't explained too well, but here is a minimal working example.

dt <- data.table(heart_failure = c(0, 1, 0),
             kidney_failure = c(1, 0, 0),
             death = c(1, 1, 1), 
             prncl_diag = c('heart_failure', 'kidney_failure', 'death'))

for (i in 1:nrow(dt)) {
  name <- dt[i, prncl_diag]
  dt <- dt[i, eval(name) := 1]
}

This code works and updates row 1 of "heart_failure" to a 1, updates row 2 of "kidney_failure" to a 1, and doesn't change row 3 of "death" column as it is already 1.

However, the code is slow with a data table of 5M rows and I know I am not utilizing the structure of data.table.

Please advise for more efficient solutions. Interested to learn about R, data.table, and efficiency from the StackOverflow community.

One option is to subset by unique values in prncl_diag .

for (val in unique(dt$prncl_diag)) {
  dt[prncl_diag == val, (val) := 1]
}

That's the way I would probably go about it, especially if there is a small number of unique values in prncl_diag relative to the number of rows.

Result:

#    heart_failure kidney_failure death     prncl_diag
# 1:             1              1     1  heart_failure
# 2:             1              1     1 kidney_failure
# 3:             0              0     1          death

Here's an answer with tidyverse

library(tidyverse)
map_df(1:nrow(dt), ~dt[.x,] %>% mutate_at(vars(.$prncl_diag), function(y) ifelse(y==0,1,y)))

  heart_failure kidney_failure death     prncl_diag
1             1              1     1  heart_failure
2             1              1     1 kidney_failure
3             0              0     1          death

I think this'll achieve what you want.

> dt[, .SD
     ][, rID := 1:.N
     ][, melt(.SD, id.vars=c('prncl_diag', 'rID'))
     ][prncl_diag == variable, value := 1
     ][, dcast(.SD, prncl_diag + rID ~ variable, value.var='value')
     ][, rID := NULL
     ][]

       prncl_diag heart_failure kidney_failure death
1:          death             0              0     1
2:  heart_failure             1              1     1
3: kidney_failure             1              1     1
> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM