简体   繁体   English

如何将函数应用于data.frame的每个元素?

[英]How to apply a function to each element of a data.frame?

I want to convert a numeric value into a factor, if the value is below -2 then "down" should be the factor, if it is above 2 then "up" and between "no_change": 我想将数值转换为因数,如果该值小于-2,则“ down”应为因数,如果该值大于2,则为“ up”,并介于“ no_change”之间:

So far I thought about creating a function: 到目前为止,我考虑过创建一个函数:

classifier <- function(x){
    if (x >= 2){
      return(as.factor("up"))
    }else if (x <= -2){
      return(as.factor("down"))
    }else {
      return(as.factor("no_change"))
    }
}

I could make it iterate (with a for loop) over the input and return a list, so I could use it with apply. 我可以对输入进行迭代(使用for循环)并返回一个列表,因此可以将其与apply一起使用。

I want to apply this function to all cells of the data.frame, how can I do it? 我想将此功能应用于data.frame的所有单元格 ,该怎么办?

mock data ( runif(15, min=-5, max=5) ): 模拟数据( runif(15, min=-5, max=5) ):

c(1.11004611710086, -1.86842617811635, 1.72159335808828, -2.68788822228089, 
2.72551498375833, 3.67290901951492, -4.00984475389123, -2.39582793787122, 
4.22395745059475, -0.360892189200968, 1.35027756914496, 2.89919016882777, 
-0.158692332915962, -0.950306688901037, 3.39141107397154)

Using DF <- iris[-5] as sample data, you can use cut , as I suggested in the comments. 使用DF <- iris[-5]作为样本数据,您可以使用cut ,正如我在评论中所建议的。

Try: 尝试:

DF[] <- lapply(DF, cut, c(-Inf, -2, 2, Inf), c("down", "no_change", "up"))

head(DF)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           up          up    no_change   no_change
## 2           up          up    no_change   no_change
## 3           up          up    no_change   no_change
## 4           up          up    no_change   no_change
## 5           up          up    no_change   no_change
## 6           up          up    no_change   no_change

tail(DF)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 145           up          up           up          up
## 146           up          up           up          up
## 147           up          up           up   no_change
## 148           up          up           up   no_change
## 149           up          up           up          up
## 150           up          up           up   no_change

Or, with RHertel's "mock_data": 或者,使用RHertel的“ mock_data”:

cut(mock_data, c(-Inf, -2, 2, Inf), c("down", "no_change", "up"))
##  [1] no_change no_change no_change down      up        up        down     
##  [8] down      up        no_change no_change up        no_change no_change
## [15] up       
## Levels: down no_change up

Benchmarks 基准测试

As I suggested in the comments, RHertel's approach is likely to be the most efficient. 正如我在评论中所建议的那样,RHertel的方法可能是最有效的。 That approach uses pretty straightforward subsetting (which is fast) and factor (which is also generally fast). 该方法使用非常简单的子集(快速)和factor (通常也很快)。

On data the size you describe, you will definitely notice the difference: 在您描述的数据大小上,您一定会注意到不同之处:

set.seed(1)
nrow = 20000
ncol = 1000
x <- as.data.frame(matrix(runif(nrow * ncol, min=-5, max=5), ncol = ncol))

factorize <- function(invec) {
  factorized <- rep("no_change", length(invec))
  factorized[invec > 2]  <- "up"
  factorized[invec < -2]  <- "down"
  factor(factorized, c("down", "no_change", "up"))
}

RHfun <- function(indf = x) {
  indf[] <- lapply(indf, factorize)
  indf
}

AMfun <- function(DF = x) {
  DF[] <- lapply(DF, cut, c(-Inf, -2, 2, Inf), c("down", "no_change", "up"))
  DF
}

library(microbenchmark)
microbenchmark(AMfun(), RHfun(), times = 10)
# Unit: seconds
#     expr      min       lq     mean   median       uq       max neval
#  AMfun() 7.501814 8.015532 8.852863 8.731638 9.660191 10.198983    10
#  RHfun() 1.437696 1.485791 1.723402 1.574507 1.637139  2.528574    10

I'm generally not fond of ifelse() , so I'd probably introduce a new vector and treat the problem differently. 我通常不喜欢ifelse() ,因此我可能会引入一个新的向量并以不同的方式对待该问题。

factorized <- rep("no_change", length(mock_data))
factorized[mock_data > 2]  <- "up"
factorized[mock_data < -2]  <- "down"
factorized <- as.factor(factorized)
#> factorized
#[1] no_change no_change no_change down      up        up        down      down      up        no_change no_change up        no_change no_change up       
#Levels: down no_change up

The data in this example is taken from the OP: 本示例中的数据取自OP:

mock_data <- c(1.11004611710086, -1.86842617811635, 1.72159335808828, -2.68788822228089, 
           2.72551498375833, 3.67290901951492, -4.00984475389123, -2.39582793787122, 
           4.22395745059475, -0.360892189200968, 1.35027756914496, 2.89919016882777, 
           -0.158692332915962, -0.950306688901037, 3.39141107397154)

Thanks to @docendo discimus for an improvement of this answer with a helpful comment. 感谢@docendo discimus对本答案的改进并提供了有用的评论。

Use apply with identifier for rows and columns. 使用apply与标识符的行列。

apply(yourDF, c(1, 2), classifier)

This is made for applying a function to every cell of a data.frame . 这样做是为了将函数应用于data.frame每个单元。 It probably won't work on vectors. 它可能不适用于矢量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM