简体   繁体   English

在 R 中对自定义函数使用 sapply

[英]use sapply on custom functions in R

(Using mtcars & iris for reproducibility) (使用 mtcars 和 iris 进行重现)

I have created a R function get_col_info to find summary of data which is as per below:我创建了一个 R function get_col_info来查找数据摘要,如下所示:

  1. If column is numeric/integer/double then get min,max,mean如果列是numeric/integer/double精度,则获取最小值、最大值、平均值

  2. If column is character/factor then get count of unique values & unique values如果列是character/factor ,则获取唯一值和唯一值的计数

     get_col_info <- function(data,col_name) { c_name <- c(col_name) s <- data[,c_name] type <- typeof(s) if(type %in% c("numeric","double","integer")){ min <- min(s) max <- max(s) mean <- mean(s) aa <- list(min=min, max=max,mean=mean) return(aa) } if(type %in% c("character","factor")){ uni <- unique(s) len <- length(uni) aa <- list(n_values=len,unique_values=c(uni)) return(aa)} } get_col_info(mtcars, "mpg") get_col_info(iris, "Petal.Width") get_col_info(iris, "Species")

The first two runs perfect, third one gives an error, not sure why?前两个运行完美,第三个给出错误,不知道为什么?

However, the main query is now I want to run this function for all column name at once, something like sapply(iris,mean) but I am not sure how to do that because the function takes in dataframe & column name.但是,现在主要的查询是我想一次为所有列名运行这个 function,类似于sapply(iris,mean)但我不确定该怎么做,因为 function 接受 dataframe 和列名。 I tried doing this but it gives me an error我试过这样做,但它给了我一个错误

sapply(iris,get_col_info(iris,names(iris)))

Error in match.fun(FUN) : 
  'get_col_info(iris, names(iris))' is not a function, character or symbol

Both apply & purrr solutions are welcome.欢迎使用 apply 和 purrr 解决方案。 I am also looking for someone to tell me how could I have written my function better, I suspect c_name that I created is not the ideal way to catch column names.我也在找人告诉我如何才能更好地编写我的 function,我怀疑我创建的 c_name 不是捕获列名的理想方式。

You should use class to check the type and not typeof :您应该使用class检查类型而不是typeof

get_col_info <- function(data,col_name) {    
  s <- data[,col_name]
  type <- class(s)
  if(type %in% c("numeric","double","integer")){
    min <- min(s)
    max <- max(s)
    mean <- mean(s)
    aa <- list(min=min, max=max,mean=mean)
    return(aa)
  }
  else if(type %in% c("character","factor")){
    uni <- as.character(unique(s))
    len <- length(uni)
    aa <- list(n_values=len,unique_values=uni)
    return(aa)
  }
}

Checking the output:查看output:

get_col_info(mtcars, "mpg")
#$min
#[1] 10.4

#$max
#[1] 33.9

#$mean
#[1] 20.09062

get_col_info(iris, "Species")
#$n_values
#[1] 3

#$unique_values
#[1] "setosa"     "versicolor" "virginica" 

To run this for multiple columns you can use:要为多个列运行此操作,您可以使用:

sapply(names(iris), get_col_info, data = iris)

Or replace sapply with map if you are interested in purrr solution.或者如果您对purrr解决方案感兴趣,请将sapply替换为map


Another way would be to pass column values directly instead of name.另一种方法是直接传递列值而不是名称。

get_col_info <- function(s) {    
  if(is.numeric(s)) {
    min <- min(s)
    max <- max(s)
    mean <- mean(s)
    aa <- list(min=min, max=max,mean=mean)
    return(aa)
  }
  else {
    uni <- as.character(unique(s))
    len <- length(uni)
    aa <- list(n_values=len,unique_values=uni)
    return(aa)
  }
}

sapply(iris, get_col_info)

You can do this using summarise and across , with type checking (like is.numeric ):您可以使用summariseacross进行此操作,并进行类型检查(如is.numeric ):

library(dplyr)

iris %>%
  summarise(across(where(is.numeric), list(min=min, max=max, mean=mean)),
            across(where(~is.factor(.) | is.character(.)), 
                   list(n_values = ~length(unique(.)), 
                        unique_values = ~as.character(unique(.))))) %>%
  glimpse()

Output: Output:

Rows: 3
Columns: 14
$ Sepal.Length_min      <dbl> 4.3, 4.3, 4.3
$ Sepal.Length_max      <dbl> 7.9, 7.9, 7.9
$ Sepal.Length_mean     <dbl> 5.843333, 5.843333, 5.843333
$ Sepal.Width_min       <dbl> 2, 2, 2
$ Sepal.Width_max       <dbl> 4.4, 4.4, 4.4
$ Sepal.Width_mean      <dbl> 3.057333, 3.057333, 3.057333
$ Petal.Length_min      <dbl> 1, 1, 1
$ Petal.Length_max      <dbl> 6.9, 6.9, 6.9
$ Petal.Length_mean     <dbl> 3.758, 3.758, 3.758
$ Petal.Width_min       <dbl> 0.1, 0.1, 0.1
$ Petal.Width_max       <dbl> 2.5, 2.5, 2.5
$ Petal.Width_mean      <dbl> 1.199333, 1.199333, 1.199333
$ Species_n_values      <int> 3, 3, 3
$ Species_unique_values <chr> "setosa", "versicolor", "virginica"

Note: I added glimpse() to make output more readable, it's not necessary.注意:我添加了glimpse()以使 output 更具可读性,这不是必需的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM