I have a data table like this:
> x
part colig
1: PR PT, PMDB
2: PMDB PT, PMDB
3: PMDB PT, PMDB
4: PDT PT, PMDB
5: PMDB PT, PMDB
6: PFL PSDB,PFL,PTB
7: PPB PSDB,PFL,PTB
8: PMDB PSDB,PFL,PTB
9: PMDB PSDB,PFL,PTB
10: PPB PSDB,PFL,PTB
> str(x)
Classes ‘data.table’ and 'data.frame': 10 obs. of 2 variables:
$ part : chr "PR" "PMDB" "PMDB" "PDT" ...
$ colig:List of 10
..$ : chr "PT" "PMDB"
..$ : chr "PT" "PMDB"
..$ : chr "PT" "PMDB"
..$ : chr "PT" "PMDB"
..$ : chr "PT" "PMDB"
..$ : chr "PSDB" "PFL" "PTB"
..$ : chr "PSDB" "PFL" "PTB"
..$ : chr "PSDB" "PFL" "PTB"
..$ : chr "PSDB" "PFL" "PTB"
..$ : chr "PSDB" "PFL" "PTB"
- attr(*, ".internal.selfref")=<externalptr>
and I want to create a dummy variable that is 1 when the first variable is contained in the second. My desired output is:
> x
part colig dummy
1: PR PT, PMDB FALSE
2: PMDB PT, PMDB TRUE
3: PMDB PT, PMDB TRUE
4: PDT PT, PMDB FALSE
5: PMDB PT, PMDB TRUE
6: PFL PSDB,PFL,PTB TRUE
7: PPB PSDB,PFL,PTB FALSE
8: PMDB PSDB,PFL,PTB FALSE
9: PMDB PSDB,PFL,PTB FALSE
10: PPB PSDB,PFL,PTB FALSE
My problem is accessing the elements inside the list in the second column. I'm trying something like:
x[, dummy := x[,part] %in% x[, colig]]
or
x[, dummy := x[,part] %in% unlist(x[, colig])]
The two options are wrong. In the first case, the dummy is always FALSE, and in the second, the unlist() command creates a list with elements from all the lists (not only from the respective row).
I also tried with lapply (like here Creating dummy variables in R data.table ):
x[, dummy := lapply( x[,part], function(y) y %in% unlist(x[,colig]))]
which I think is correct, but I am having problems with speed because I have a lot of rows.
Is there any faster option?
Use grepl
and do it by each value of "part":
x[, dummy := grepl(part, colig), by = part]
Upon second reading of OP, I'm not sure what's going on in that column - looks like some of the elements are lists and others are characters. The above will work for characters (and you can squeeze in lapply(colig, toString)
somewhere to convert the list to strings).
Try with stringi
, it should be fast.
library(stringi)
x$dummy = stri_detect(x[,"colig"], fixed=x[,"part"])
# part colig dummy
# 2 PR PT, PMDB FALSE
# 3 PMDB PT, PMDB TRUE
# 4 PMDB PT, PMDB TRUE
# 5 PDT PT, PMDB FALSE
# 6 PMDB PT, PMDB TRUE
# 7 PFL PSDB,PFL,PTB TRUE
# 8 PPB PSDB,PFL,PTB FALSE
# 9 PMDB PSDB,PFL,PTB FALSE
# 10 PMDB PSDB,PFL,PTB FALSE
# 11 PPB PSDB,PFL,PTB FALSE
or as data.table
setDT(x)[, dummy := stri_detect(colig, fixed=part)]
If you a mixture of lists and unseparated strings as it appears you might, try something like
setDT(x)[, dummy := any(stri_detect(colig, fixed=part)), by=1:nrow(x)]
From your str(x)
output, you seem to have some problems with your data. The first few rows of colig
do not appear to be split. In other words, you probably mean to have the two elements "PT", "PMDB" rather than the single element "PT, PMDB". This may be part of the problem. Apply strsplit
as necessary.
If your sample is representative, then doing simply
apply(x,1,function(x) x$part %in% x$colig)
where x
is just a data.frame
should be plenty fast. I replicated a corrected version of your x
to 100000 rows and this ran in a fraction of a second.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.