简体   繁体   中英

Creating dummy variable from list in a data.table in R

I have a data table like this:

> x
   part        colig
 1:   PR     PT, PMDB
 2: PMDB     PT, PMDB
 3: PMDB     PT, PMDB
 4:  PDT     PT, PMDB
 5: PMDB     PT, PMDB
 6:  PFL PSDB,PFL,PTB
 7:  PPB PSDB,PFL,PTB
 8: PMDB PSDB,PFL,PTB
 9: PMDB PSDB,PFL,PTB
10:  PPB PSDB,PFL,PTB
> str(x)
Classes ‘data.table’ and 'data.frame':  10 obs. of  2 variables:
 $ part : chr  "PR" "PMDB" "PMDB" "PDT" ...
 $ colig:List of 10
  ..$ : chr "PT" "PMDB"
  ..$ : chr "PT" "PMDB"
  ..$ : chr "PT" "PMDB"
  ..$ : chr "PT" "PMDB"
  ..$ : chr "PT" "PMDB"
  ..$ : chr  "PSDB" "PFL" "PTB"
  ..$ : chr  "PSDB" "PFL" "PTB"
  ..$ : chr  "PSDB" "PFL" "PTB"
  ..$ : chr  "PSDB" "PFL" "PTB"
  ..$ : chr  "PSDB" "PFL" "PTB"
 - attr(*, ".internal.selfref")=<externalptr> 

and I want to create a dummy variable that is 1 when the first variable is contained in the second. My desired output is:

> x
    part        colig dummy
 1:   PR     PT, PMDB FALSE
 2: PMDB     PT, PMDB  TRUE
 3: PMDB     PT, PMDB  TRUE
 4:  PDT     PT, PMDB FALSE
 5: PMDB     PT, PMDB  TRUE
 6:  PFL PSDB,PFL,PTB  TRUE
 7:  PPB PSDB,PFL,PTB FALSE
 8: PMDB PSDB,PFL,PTB FALSE
 9: PMDB PSDB,PFL,PTB FALSE
10:  PPB PSDB,PFL,PTB FALSE

My problem is accessing the elements inside the list in the second column. I'm trying something like:

x[, dummy := x[,part] %in% x[, colig]]

or

x[, dummy := x[,part] %in% unlist(x[, colig])]

The two options are wrong. In the first case, the dummy is always FALSE, and in the second, the unlist() command creates a list with elements from all the lists (not only from the respective row).

I also tried with lapply (like here Creating dummy variables in R data.table ):

x[, dummy := lapply( x[,part], function(y) y %in% unlist(x[,colig]))]

which I think is correct, but I am having problems with speed because I have a lot of rows.

Is there any faster option?

Use grepl and do it by each value of "part":

x[, dummy := grepl(part, colig), by = part]

Upon second reading of OP, I'm not sure what's going on in that column - looks like some of the elements are lists and others are characters. The above will work for characters (and you can squeeze in lapply(colig, toString) somewhere to convert the list to strings).

Try with stringi , it should be fast.

library(stringi)
x$dummy = stri_detect(x[,"colig"], fixed=x[,"part"])
#    part        colig dummy
# 2    PR     PT, PMDB FALSE
# 3  PMDB     PT, PMDB  TRUE
# 4  PMDB     PT, PMDB  TRUE
# 5   PDT     PT, PMDB FALSE
# 6  PMDB     PT, PMDB  TRUE
# 7   PFL PSDB,PFL,PTB  TRUE
# 8   PPB PSDB,PFL,PTB FALSE
# 9  PMDB PSDB,PFL,PTB FALSE
# 10 PMDB PSDB,PFL,PTB FALSE
# 11  PPB PSDB,PFL,PTB FALSE

or as data.table

setDT(x)[, dummy := stri_detect(colig, fixed=part)]

Edit

If you a mixture of lists and unseparated strings as it appears you might, try something like

setDT(x)[, dummy := any(stri_detect(colig, fixed=part)), by=1:nrow(x)]

From your str(x) output, you seem to have some problems with your data. The first few rows of colig do not appear to be split. In other words, you probably mean to have the two elements "PT", "PMDB" rather than the single element "PT, PMDB". This may be part of the problem. Apply strsplit as necessary.

If your sample is representative, then doing simply

apply(x,1,function(x) x$part %in% x$colig)

where x is just a data.frame should be plenty fast. I replicated a corrected version of your x to 100000 rows and this ran in a fraction of a second.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM