I have the following R data.table, which has one column which is a list with numeric elements:
library(data.table)
dt = data.table(
numericcol = rep(42, 8),
listcol = list(c(1, 22, 3), 6, 1, 12, c(5, 6, 1123), 3, 42, 1)
)
> dt
numericcol listcol
1: 42 1,22, 3
2: 42 6
3: 42 1
4: 42 12
5: 42 5, 6,1123
6: 42 3
7: 42 42
8: 42 1
I would like to create two columns: (1) a column that shows the size of each list element and (2) a boolean column, TRUE if 1 is an element, FALSE otherwise.
Here is what the output should look like:
numericcol listcol size ones
1: 42 1,22, 3 3 TRUE
2: 42 6 1 FALSE
3: 42 1 1 TRUE
4: 42 12 1 FALSE
5: 42 5, 6,1123 3 FALSE
6: 42 3 1 FALSE
7: 42 42 1 FALSE
8: 42 1 1 TRUE
So, I know how to create the column size
, ie
dt[, size:=sapply(dt$listcol, length)]
And I know how to check whether rows with elements have 1 if there is only a single digit there, ie
dt[, ones := dt$listcol[dt$listcol == 1] ]
This assumption is wrong however. I don't know how to check that rows of the list column with multiple integers are composed of a 1 or not.
What is an efficient way to do this?
dt[, o := sapply(listcol, function(x) 1 %in% x)]
dt
# numericcol listcol o
# 1: 42 1,22, 3 TRUE
# 2: 42 6 FALSE
# 3: 42 1 TRUE
# 4: 42 12 FALSE
# 5: 42 5, 6,1123 FALSE
# 6: 42 3 FALSE
# 7: 42 42 FALSE
# 8: 42 1 TRUE
We can create the 'size' by taking the lengths
of 'listcol', then loop through the 'listcol', check whether 1 is %in%
each of the vector
s and assign it to 'ones'
dt[, size := lengths(listcol)
][, ones := unlist(lapply(listcol, function(x) 1 %in% x))]
dt
# numericcol listcol size ones
#1: 42 1,22, 3 3 TRUE
#2: 42 6 1 FALSE
#3: 42 1 1 TRUE
#4: 42 12 1 FALSE
#5: 42 5, 6,1123 3 FALSE
#6: 42 3 1 FALSE
#7: 42 42 1 FALSE
#8: 42 1 1 TRUE
Or another option would be using map
from purrr
which is a bit more efficient
library(purrr)
dt[, ones := map_lgl(listcol, `%in%`, x = 1)]
and if there is the option for parallel processing
library(furrr)
plan(multiprocess)
dt[, one := future_map_lgl(listcol, `%in%`, x = 1)]
Also, if we intend to do this with tidyverse
dt %>%
mutate(size = lengths(listcol),
ones = map(listcol, `%in%`, x = 1))
set.seed(24)
dt1 <- data.table( numericcol = rep(42, 8000000),
listcol = rep(list(c(1, 22, 3), 6, 1, 12, c(5, 6, 1123), 3, 42, 1), 1e6))
dt2 <- copy(dt1)
#timing for creating the size column
system.time({
dt1[, size := lengths(listcol)]
})
# user system elapsed
# 0.3 0.0 0.3
system.time({
dt2[, size:= sapply(listcol, length)]
})
# user system elapsed
# 6.45 0.28 6.97
#timing for creating the second column
system.time({
dt1[, ones := unlist(lapply(listcol, function(x) 1 %in% x))]
})
# user system elapsed
# 15.12 0.26 16.42
system.time({
dt2[, ones := sapply(listcol, function(x) any(1 %in% x))]
})
# user system elapsed
# 17.00 0.04 17.52
system.time({
dt2[, one := map_lgl(listcol, `%in%`, x = 1)]
})
# user system elapsed
# 10.92 0.00 11.25
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.