简体   繁体   English

R中具有列表的data.table赋值运算符

[英]data.table assignment operator with lists in R

I have a data.table containing a name column, and I'm trying to extract a regular expression from this name. 我有一个包含name列的data.table,并且正在尝试从该名称提取正则表达式。 The most obvious way to do it in this case is with the := operator, as I'm assigning this extracted string as the actual name of the data. 在这种情况下,最明显的方法是使用:=运算符,因为我将提取的字符串分配为数据的实际名称。 In doing so, I find that this doesn't actually apply the function in the way that I would expect. 这样,我发现这实际上并没有按照我期望的方式应用该功能。 I'm not sure if it's intentional, and I was wondering if there's a reason it does what it does or if it's a bug. 我不确定它是否是故意的,并且我想知道它是否有做它的原因或它是否是一个错误。

library(data.table)
dt <- data.table(name = c('foo123', 'bar234'))

Searching for the desired expression in a simple character vector behaves as expected: 在简单的字符向量中搜索所需的表达式的行为符合预期:

name <- dt[1, name]
pattern <- '(.*?)\\d+'
regmatches(name, regexec(pattern, name))
[[1]]
[1] "foo123" "foo"  

I can easily subset this to get what I want 我可以轻松地将其子集化以获得我想要的

regmatches(name, regexec(pattern, name))[[1]][2]
[1] "foo"

However, I run into issues when I try to apply this to the entire data.table: 但是,当我尝试将其应用于整个data.table时遇到问题:

dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2]]
dt
    name name_final
1: foo123        foo
2: bar234        foo

I don't know how data.table works internally, but I would guess that the function was applied to the entire name column first, and then the result is coerced into a vector somehow and then assigned to the new name_final column. 我不知道data.table在内部如何工作,但是我猜想该函数将首先应用于整个name列,然后将结果以某种方式强制转换为向量,然后分配给新的name_final列。 However, the behavior I would expect here would be on a row-by-row basis. 但是,我在这里期望的行为是逐行的。 I can emulate this behavior by adding a dummy id column; 我可以通过添加一个虚拟id列来模拟这种行为。

dt[, id := seq_along(name)]
dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2], by = list(id)]
dt
    name name_final id
1: foo123        foo  1
2: bar234        bar  2

Is there a reason that this isn't the default behavior? 是否有这不是默认行为的原因? If so, I would guess that it had to do with columns being atomic to the data.table rather than the rows, but I'd like to understand what's going on there. 如果是这样,我猜想这与data.table的原子列有关,而不是与行有关,但是我想了解那里发生了什么。

Pretty much nothing in R runs on a row-by-row basis. R中几乎没有任何内容是逐行运行的。 It's always better to work with columns of data at a time so you can pretty much assume that the entire column vector of values will be passed in as a parameter to your function. 一次处理数据列总是更好,因此您可以非常假设值的整个列向量都将作为参数传递给函数。 Here's a way to extract the second element for each item in the regmatches list 这是为regmatches列表中的每个项目提取第二个元素的方法

dt[, name_final := sapply(regmatches(name, regexec(pattern, name)), `[`, 2)]

Functions like sapply() or Vectorize() can "fake" a per-row type call for functions that aren't meant to be run on a vector/list of data at a time. 诸如sapply()Vectorize()类的函数可以“伪造”每行类型的调用,这些调用不希望一次在向量/数据列表上运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM