简体   繁体   English

在 R 函数中调用 data.frame 列?

[英]Call data.frame columns inside of R functions?

What is the proper way to do this?这样做的正确方法是什么?

I have a function that works great on its own given a series of inputs and I'd like to use this function on a large dataset rather than singular values by looping through the data by row.我有一个函数,在给定一系列输入的情况下,它自己工作得很好,我想通过逐行循环数据来在大型数据集而不是奇异值上使用这个函数。 I have tried to update the function to call data.frame columns rather than vector values, but have been unsuccessful.我试图更新函数以调用 data.frame 列而不是向量值,但没有成功。

A simple example of this is:一个简单的例子是:

Let's say I have a date.frame with 4 columns, data$id, data$height, data$weight, data$gender.假设我有一个包含 4 列、data$id、data$height、data$weight、data$gender 的 date.frame。 I want to write a function that will loop over each row (using apply) and calculate BMI (kg/m^2).我想编写一个函数来遍历每一行(使用apply)并计算BMI(kg/m^2)。 I know that it would be easy to do with dplyr but I would like to learn how to do this without resorting to external packages but can't find a clear answer how to properly reference the columns within the function.我知道使用 dplyr 很容易做到,但我想学习如何在不求助于外部包的情况下做到这一点,但找不到如何正确引用函数中的列的明确答案。

Apologize in advance if this is a duplicate.如果这是重复的,请提前道歉。 I've been searching Stackoverflow pretty thoroughly in hopes of finding an exisiting example.我一直在彻底搜索 Stackoverflow,希望能找到一个现有的例子。

I think this is what you're looking for.我想这就是你要找的。 The easiest way to refer to columns of a data frame functionally is to use quoted column names.在功能上引用数据框列的最简单方法是使用带引号的列名。 In principle, what you're doing is this原则上,你正在做的是这个

data[, "weight"] / data[, "height"]^2

but inside a function you might want to let the user specify that the height or weight column is named differently, so you can write your function但是在函数内部,您可能希望让用户指定高度或重量列的名称不同,因此您可以编写函数

add_bmi = function(data, height_col = "height", weight_col = "weight") {
    data$bmi = data[, weight_col] / data[, height_col]
    return(data)
}

This function will assume that the columns to use are named "height" and "weight" by default, but the user can specify other names if necessary.此函数将假定要使用的列默认命名为“高度”和“重量”,但用户可以根据需要指定其他名称。 You could do a similar solution using column indices instead, but using names tends to be easier to debug.您可以使用列索引来执行类似的解决方案,但使用名称往往更容易调试。

Functions this simple are rarely useful.这么简单的函数很少有用。 If you're calculating BMI for a lot of datasets maybe it is worth keeping this function around, but since it is a one-liner in base R you probably don't need it.如果您正在计算大量数据集的 BMI,那么保留此函数可能是值得的,但由于它是 base R 中的单行函数,因此您可能不需要它。

my_data$BMI = with(my_data, weight / height^2)

One note is that using column names stored in variables means you can't use $ .一个注意事项是使用存储在变量中的列名意味着您不能使用$ This is the price we pay by making things more programmatic, and it's a good habit to form for such applications.这是我们通过使事情更加程序化而付出的代价,并且为此类应用程序形成一个好习惯。 See fortunes::fortune(343) :fortunes::fortune(343)

Sooner or later most R beginners are bitten by this all too convenient shortcut.大多数 R 初学者迟早会被这个太方便的捷径所吸引。 As an R newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable consequences.作为 R 新手,将 R 视为您的银行帐户:过度使用 $-extraction 会导致不良后果。 It's best to acquire the '[[' and '[' habit early.最好尽早养成 '[[' 和 '[' 习惯。

-- Peter Ehlers (about the use of $-extraction) R-help (March 2013) -- Peter Ehlers(关于 $-extraction 的使用)R-help(2013 年 3 月)

For fancier usage like dplyr does where you don't have to quote column names and such (and can evaluate expressions), the lazyeval package makes things relatively painless and has very nice vignettes.对于像dplyr这样的dplyr用法,您不必引用列名等(并且可以评估表达式), lazyeval包使事情相对轻松,并且有非常漂亮的小插曲。

The base function with can be used to do some lazy evaluating, eg,基函数with可用于进行一些惰性求值,例如,

with(mtcars, plot(disp, mpg))
# sometimes with is nice
plot(mtcars$disp, mtcars$mpg)

but with is best used interactively and in straightforward scripts.with最好以交互方式和简单的脚本使用。 If you get into writing programmatic production code (eg, your own R package), it's safer to avoid non-standard evaluation.如果您开始编写程序化生产代码(例如,您自己的 R 包),避免非标准评估会更安全。 See, for example, the warning in ?subset , another base R function that uses non-standard evaluation.例如,请参阅?subset的警告,这是另一个使用非标准评估的基本 R 函数。

Speaking generally, functions should not know about more than they need to know about.一般来说,函数不应该知道比他们需要知道的更多。 If you write a function that requires a data.frame, when it is not essential that the input data be provided in a data.frame, then you are making your function more restrictive than it needs to be.如果您编写一个需要 data.frame 的函数,而在 data.frame 中提供输入数据不是必需的,那么您的函数就会比它需要的限制更多。

The correct way to write this function is as follows:这个函数的正确写法如下:

bmi <- function(height,weight) weight/height^2;

This will allow you compute a vector of BMI values from a vector of height values and a vector of weight values, since both / and ^ are vectorized operations.这将允许您根据身高值向量和体重值向量计算 BMI 值向量,因为/^都是向量化操作。 So, for example, if you had two loose vectors of height and weight, then you could call it as follows:因此,例如,如果您有两个松散的身高和体重向量,那么您可以这样调用它:

set.seed(1);
N <- 5;
height <- rnorm(N,1.7,0.2);
weight <- rnorm(N,65,4);
BMI <- bmi(height,weight);
height; weight; BMI;
## [1] 1.574709 1.736729 1.532874 2.019056 1.765902
## [1] 61.71813 66.94972 67.95330 67.30313 63.77845
## [1] 24.88926 22.19652 28.91995 16.50967 20.45224

And if you had your inputs contained in a data.frame, you would be able to do this:如果您的输入包含在 data.frame 中,您将能够执行以下操作:

set.seed(2);
N <- 5;
df <- data.frame(id=1:N, height=rnorm(N,1.7,0.2), weight=rnorm(N,65,4), gender=sample(c('M','F'),N,replace=T) );
df$BMI <- bmi(df$height,df$weight);
df;
##   id   height   weight gender      BMI
## 1  1 1.520617 65.52968      F 28.33990
## 2  2 1.736970 67.83182      M 22.48272
## 3  3 2.017569 64.04121      F 15.73268
## 4  4 1.473925 72.93790      M 33.57396
## 5  5 1.683950 64.44485      M 22.72637

Providing this answer as I was not able to find it on SO and banged my head against the wall trying to figure out why my function within my R package was assuming my new column was an object and not a data.frame column.提供这个答案,因为我无法在 SO 上找到它,并试图弄清楚为什么我的 R 包中的函数假设我的新列是一个对象而不是 data.frame 列。

If a function takes in a data.frame and within the function you are adding and transforming the additional column(s), the way to do so is as follows:如果一个函数接受一个 data.frame 并且在您要添加和转换附加列的函数中,这样做的方法如下:

example_func <- function(df) {
  # To add a new column
  df[["New.Column"]] <- value
  
  # To get the ith value of that column
  df[[i, "New.Column"]]

  # To subset set the df using some conditional logic on that column
  df[df[["New.Column"]]==value]

  # To sort on that column
  setorderv(df, "New.Column", -1)
}

Note this requires library(devtools)注意这需要library(devtools)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM