R中具有列表的data.table赋值运算符

Question

I have a data.table containing a name column, and I'm trying to extract a regular expression from this name. 我有一个包含name列的data.table，并且正在尝试从该名称提取正则表达式。 The most obvious way to do it in this case is with the := operator, as I'm assigning this extracted string as the actual name of the data. 在这种情况下，最明显的方法是使用:=运算符，因为我将提取的字符串分配为数据的实际名称。 In doing so, I find that this doesn't actually apply the function in the way that I would expect. 这样，我发现这实际上并没有按照我期望的方式应用该功能。 I'm not sure if it's intentional, and I was wondering if there's a reason it does what it does or if it's a bug. 我不确定它是否是故意的，并且我想知道它是否有做它的原因或它是否是一个错误。

library(data.table)
dt <- data.table(name = c('foo123', 'bar234'))

Searching for the desired expression in a simple character vector behaves as expected: 在简单的字符向量中搜索所需的表达式的行为符合预期：

name <- dt[1, name]
pattern <- '(.*?)\\d+'
regmatches(name, regexec(pattern, name))
[[1]]
[1] "foo123" "foo"

I can easily subset this to get what I want 我可以轻松地将其子集化以获得我想要的

regmatches(name, regexec(pattern, name))[[1]][2]
[1] "foo"

However, I run into issues when I try to apply this to the entire data.table: 但是，当我尝试将其应用于整个data.table时遇到问题：

dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2]]
dt
    name name_final
1: foo123        foo
2: bar234        foo

I don't know how data.table works internally, but I would guess that the function was applied to the entire name column first, and then the result is coerced into a vector somehow and then assigned to the new name_final column. 我不知道data.table在内部如何工作，但是我猜想该函数将首先应用于整个name列，然后将结果以某种方式强制转换为向量，然后分配给新的name_final列。 However, the behavior I would expect here would be on a row-by-row basis. 但是，我在这里期望的行为是逐行的。 I can emulate this behavior by adding a dummy id column; 我可以通过添加一个虚拟id列来模拟这种行为。

dt[, id := seq_along(name)]
dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2], by = list(id)]
dt
    name name_final id
1: foo123        foo  1
2: bar234        bar  2

Is there a reason that this isn't the default behavior? 是否有这不是默认行为的原因？ If so, I would guess that it had to do with columns being atomic to the data.table rather than the rows, but I'd like to understand what's going on there. 如果是这样，我猜想这与data.table的原子列有关，而不是与行有关，但是我想了解那里发生了什么。

Answer 1

Pretty much nothing in R runs on a row-by-row basis. R中几乎没有任何内容是逐行运行的。 It's always better to work with columns of data at a time so you can pretty much assume that the entire column vector of values will be passed in as a parameter to your function. 一次处理数据列总是更好，因此您可以非常假设值的整个列向量都将作为参数传递给函数。 Here's a way to extract the second element for each item in the regmatches list 这是为regmatches列表中的每个项目提取第二个元素的方法

dt[, name_final := sapply(regmatches(name, regexec(pattern, name)), `[`, 2)]

Functions like sapply() or Vectorize() can "fake" a per-row type call for functions that aren't meant to be run on a vector/list of data at a time. 诸如sapply()或Vectorize()类的函数可以“伪造”每行类型的调用，这些调用不希望一次在向量/数据列表上运行。

R中具有列表的data.table赋值运算符

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-01-22 16:11:53

R中具有列表的data.table赋值运算符

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-01-22 16:11:53

解决方案1
3 已采纳 2015-01-22 16:11:53