简体   繁体   中英

data.table assignment operator with lists in R

I have a data.table containing a name column, and I'm trying to extract a regular expression from this name. The most obvious way to do it in this case is with the := operator, as I'm assigning this extracted string as the actual name of the data. In doing so, I find that this doesn't actually apply the function in the way that I would expect. I'm not sure if it's intentional, and I was wondering if there's a reason it does what it does or if it's a bug.

library(data.table)
dt <- data.table(name = c('foo123', 'bar234'))

Searching for the desired expression in a simple character vector behaves as expected:

name <- dt[1, name]
pattern <- '(.*?)\\d+'
regmatches(name, regexec(pattern, name))
[[1]]
[1] "foo123" "foo"  

I can easily subset this to get what I want

regmatches(name, regexec(pattern, name))[[1]][2]
[1] "foo"

However, I run into issues when I try to apply this to the entire data.table:

dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2]]
dt
    name name_final
1: foo123        foo
2: bar234        foo

I don't know how data.table works internally, but I would guess that the function was applied to the entire name column first, and then the result is coerced into a vector somehow and then assigned to the new name_final column. However, the behavior I would expect here would be on a row-by-row basis. I can emulate this behavior by adding a dummy id column;

dt[, id := seq_along(name)]
dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2], by = list(id)]
dt
    name name_final id
1: foo123        foo  1
2: bar234        bar  2

Is there a reason that this isn't the default behavior? If so, I would guess that it had to do with columns being atomic to the data.table rather than the rows, but I'd like to understand what's going on there.

Pretty much nothing in R runs on a row-by-row basis. It's always better to work with columns of data at a time so you can pretty much assume that the entire column vector of values will be passed in as a parameter to your function. Here's a way to extract the second element for each item in the regmatches list

dt[, name_final := sapply(regmatches(name, regexec(pattern, name)), `[`, 2)]

Functions like sapply() or Vectorize() can "fake" a per-row type call for functions that aren't meant to be run on a vector/list of data at a time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM