R：通过引用函数传递data.frame

Question

I pass a data.frame as parameter to a function that want to alter the data inside: 我将data.frame作为参数传递给想要更改内部数据的函数：

x <- data.frame(value=c(1,2,3,4))
f <- function(d){
  for(i in 1:nrow(d)) {
    if(d$value[i] %% 2 == 0){
      d$value[i] <-0
    }
  }
  print(d)
}

When I execute f(x) I can see how the data.frame inside gets modified: 当我执行f(x)我可以看到data.frame内部是如何被修改的：

> f(x)
  value
1     1
2     0
3     3
4     0

However, the original data.frame I passed is unmodified: 但是，我传递的原始data.frame是未修改的：

Usually I have overcame this by returning the modified one: 通常我通过返回修改后的一个克服了这个：

f <- function(d){
  for(i in 1:nrow(d)) {
    if(d$value[i] %% 2 == 0){
      d$value[i] <-0
    }
  }
  d
}

And then call the method reassigning the content: 然后调用重新分配内容的方法：

> x <- f(x)
> x
  value
1     1
2     0
3     3
4     0

However, I wonder what is the effect of this behaviour in a very large data.frame , is a new one grown for the method execution? 但是，我想知道这个行为在一个非常大的data.frame中是什么影响，是一个为方法执行而增长的新东西？ Which is the R-ish way of doing this? R-ish这样做的方法是什么？

Is there a way to modify the original one without creating another one in memory? 有没有办法修改原始的，而不在内存中创建另一个？

Answer 1

Actually in R (almost) each modification is performed on a copy of the previous data ( copy-on-writing behavior). 实际上在R（几乎）中，每个修改都在先前数据的副本上执行（写时复制行为）。
So for example inside your function, when you do d$value[i] <-0 actually some copies are created. 因此，例如在您的函数内部，当您执行d$value[i] <-0实际创建了一些副本。 You usually won't notice that since it's well optimized, but you can trace it by using tracemem function. 您通常不会注意到，因为它已经过优化，但您可以使用tracemem函数进行跟踪。

That being said, if your data.frame is not really big you can stick with your function returning the modified object, since it's just one more copy afterall. 话虽这么说，如果你的data.frame不是很大，你可以坚持使用你的函数返回修改过的对象，因为它只是一个副本。

But, if your dataset is really big and doing a copy everytime can be really expensive, you can use data.table, that allows in-place modifications, eg : 但是，如果您的数据集非常大并且每次都进行复制可能非常昂贵，那么您可以使用data.table，它允许就地修改，例如：

library(data.table)
d <- data.table(value=c(1,2,3,4))
f <- function(d){
  for(i in 1:nrow(d)) {
    if(d$value[i] %% 2 == 0){
      set(d,i,1L,0) # special function of data.table (see also ?`:=` )
    }
  }
  print(d)
}

f(d)
print(d)

# results :
> f(d)
   value
1:     1
2:     0
3:     3
4:     0
> 
> print(d)
   value
1:     1
2:     0
3:     3
4:     0

NB NB

In this specific case, the loop can be replaced with a "vectorized" and more efficient version eg : 在这种特定情况下，循环可以用“矢量化”和更有效的版本替换，例如：

d[d$value %% 2 == 0,'value'] <- 0

but maybe your real loop code is much more convoluted and cannot be vectorized easily. 但也许你真正的循环代码更复杂，不能轻易地进行矢量化。

R：通过引用函数传递data.frame

问题描述

1 个解决方案

解决方案1
8 已采纳 2015-10-17 12:55:55

R：通过引用函数传递data.frame

问题描述

1 个解决方案

解决方案1 8 已采纳 2015-10-17 12:55:55

解决方案1
8 已采纳 2015-10-17 12:55:55