With data.table in R, how do I efficiently replace multiple column values named within a single column?

Question

In R, I have a data table with a column ("prncl_diag") that has values of diagnoses. These diagnosis values (in the prncl_diag column) all appear as columns in the data table as well. There are ~2.5K diagnosis columns of which a subset appear as values in the "prncl_diag" column.

I want to update the diagnosis indicator columns with a 1 if its name appears in a given row of "prncl_diag" column.

That isn't explained too well, but here is a minimal working example.

dt <- data.table(heart_failure = c(0, 1, 0),
             kidney_failure = c(1, 0, 0),
             death = c(1, 1, 1), 
             prncl_diag = c('heart_failure', 'kidney_failure', 'death'))

for (i in 1:nrow(dt)) {
  name <- dt[i, prncl_diag]
  dt <- dt[i, eval(name) := 1]
}

This code works and updates row 1 of "heart_failure" to a 1, updates row 2 of "kidney_failure" to a 1, and doesn't change row 3 of "death" column as it is already 1.

However, the code is slow with a data table of 5M rows and I know I am not utilizing the structure of data.table.

Please advise for more efficient solutions. Interested to learn about R, data.table, and efficiency from the StackOverflow community.

Answer 1

One option is to subset by unique values in prncl_diag .

for (val in unique(dt$prncl_diag)) {
  dt[prncl_diag == val, (val) := 1]
}

That's the way I would probably go about it, especially if there is a small number of unique values in prncl_diag relative to the number of rows.

Result:

#    heart_failure kidney_failure death     prncl_diag
# 1:             1              1     1  heart_failure
# 2:             1              1     1 kidney_failure
# 3:             0              0     1          death

Answer 2

Here's an answer with tidyverse

library(tidyverse)
map_df(1:nrow(dt), ~dt[.x,] %>% mutate_at(vars(.$prncl_diag), function(y) ifelse(y==0,1,y)))

  heart_failure kidney_failure death     prncl_diag
1             1              1     1  heart_failure
2             1              1     1 kidney_failure
3             0              0     1          death

Answer 3

I think this'll achieve what you want.

> dt[, .SD
     ][, rID := 1:.N
     ][, melt(.SD, id.vars=c('prncl_diag', 'rID'))
     ][prncl_diag == variable, value := 1
     ][, dcast(.SD, prncl_diag + rID ~ variable, value.var='value')
     ][, rID := NULL
     ][]

       prncl_diag heart_failure kidney_failure death
1:          death             0              0     1
2:  heart_failure             1              1     1
3: kidney_failure             1              1     1
>

With data.table in R, how do I efficiently replace multiple column values named within a single column?

Question

3 answers

solution1
2 ACCPTED 2017-07-20 15:32:02

solution2
1 2017-07-20 16:28:55

solution3
1 2017-07-20 21:56:43

With data.table in R, how do I efficiently replace multiple column values named within a single column?

Question

3 answers

solution1 2 ACCPTED 2017-07-20 15:32:02

solution2 1 2017-07-20 16:28:55

solution3 1 2017-07-20 21:56:43

solution1
2 ACCPTED 2017-07-20 15:32:02

solution2
1 2017-07-20 16:28:55

solution3
1 2017-07-20 21:56:43