简体   繁体   English

R自定义函数,适用于数据框中的所有变量

[英]R custom function to apply to all variables in a dataframe

I am trying to create a custom function that would, applied within a loop, give me a table with all the informations I need for all the variables of my table. 我正在尝试创建一个自定义函数,该函数将在循环中应用,从而为我提供一个表,其中包含表中所有变量所需的所有信息。 My function is based on dplyr functions and base . 我的函数基于dplyr函数和base

myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))

My problem is that the base function ( names() ) requires the y argument (the variable name) to be given with quotation marks, but the dplyr function n_distinct needs to be simply so without quotation marks to give the right answer with na.rm=TRUE (if I use n_distinct(x[y], na.rm=TRUE) it doesn't give me a result without NA values). 我的问题是base函数( names() )要求y参数(变量名)用引号引起来,但dplyr函数n_distinct必须简单地没有引号,以便用na.rm=TRUE给出正确的答案na.rm=TRUE (如果我使用n_distinct(x[y], na.rm=TRUE) ,则没有NA值也不会给出结果)。 So I don't know how to find a solution to have the good form of the y argument to pass in both functions. 因此,我不知道如何找到一种解决方案以使y参数的良好形式传递给两个函数。 I've tried using \\" for the names() function, but it didn't seemed to work. Here the errors I obtain: 我已经尝试使用\\"作为names()函数,但是似乎没有用。在这里,我得到了以下错误:

myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
myfun(mtcars, "cyl")

Error: Error in summarise_impl(.data, dots) : variable 'y' not found 错误: Error in summarise_impl(.data, dots) : variable 'y' not found错误: Error in summarise_impl(.data, dots) : variable 'y' not found

myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
myfun(mtcars, cyl)

Error: Error in summarise_impl(.data, dots) : Evaluation error: object 'cyl' not found. 错误: Error in summarise_impl(.data, dots) : Evaluation error: object 'cyl' not found.

myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(x[y])), blank=n()-sum(!is.na(x[y])), distinct=n_distinct(x[y], na.rm=TRUE))
myfun(mtcars, "cyl")

No error, but na.rm=TRUE doesn't seem to be seen. 没有错误,但似乎没有看到na.rm=TRUE

My goal would then be apple with some loop to make a table with one row for each variable of my dataframe that I could then export to have these informations for all the variables in just one table. 然后,我的目标是成为一个循环的苹果,以便为我的数据框的每个变量创建一张表,并在其中一行,然后我可以导出该表,以使所有变量的信息仅在一个表中。

I tried to make a minimal reproducible example: 我试图做一个最小的可重现的例子:

library(dplyr)
myfun <- function(x, y) summarise(x, var=names(x[, y]), n=sum(!is.na(x[, y])), blank=n()-sum(!is.na(x[, y])), n_distinct=n_distinct(x[, y], na.rm=TRUE))
a <- mtcars%>%
  summarise(n=sum(!is.na(cyl)), blank=n()-sum(!is.na(cyl)), n_distinct=n_distinct(cyl, na.rm=TRUE))
a <- lapply(colnames(mtcars), function(x) data.frame(bind_rows(a, myfun(mtcars, x))))
a <- data.frame(bind_rows(a, myfun(mtcars, "cyl")))
a <- a%>%
  filter(!is.na(var))%>%
  distinct(var, .keep_all=TRUE)

But for some incomprehensible reason (at least for me) it doesn't work (line a <- lapply(colnames(mtcars), function(x) data.frame(bind_rows(a, myfun(mtcars, x)))) , error message Error in summarise_impl(.data, dots) : Column var is of unsupported type NULL ). 但是由于某些不可理解的原因(至少对我而言)它不起作用(行a <- lapply(colnames(mtcars), function(x) data.frame(bind_rows(a, myfun(mtcars, x)))) ,错误消息Error in summarise_impl(.data, dots) : Column var is of unsupported type NULL )。 It works fine with my dataframe, I subsetted it and it still worked fine, I manually created the same again by writting from hand all the same values in the same class, it didn't work... So I'm really lost, don't understand why it works for my dataset but no other, and because I'm new in R and just learn that by trying, without having lectures about this language code, I sometimes have no idea what I'm really doing but it works (like this code above for me), and then no more... 它可以很好地与我的数据框一起使用,我对其进行了子集化,但仍然可以正常工作,我通过手工编写同一类中的所有相同值来再次手动创建了相同的框,但是它却无法正常工作……所以我真的迷失了,不了解为什么它对我的数据集有效,而对其他数据不起作用,并且因为我是R语言的新手,只是通过尝试来学习它,而没有关于该语言代码的演讲,我有时不知道我在做什么,但是起作用(就像上面的代码对我来说),然后就没有更多...

So this code works for me pretty good, there is just the problem as said that because I use n_distinct(x[, y]) it ignores na.rm=TRUE , what I cannot understand. 所以这段代码对我来说非常有效,只是有一个问题,因为我使用n_distinct(x[, y])它忽略了na.rm=TRUE ,我无法理解。

Sorry for the rather uncomprehensive question I asked I think, I would be glad to edit it if you leaves comment about how to clarify it. 很抱歉,我问了我这个相当不全面的问题,如果您对如何澄清它持评论意见,我将很高兴对其进行编辑。 I'm simply totally lost with my try and have no idea how to present things in a clearer way. 我只是完全迷失了自己的尝试,不知道如何以一种更清晰的方式呈现事物。 Thanks for the help and sorry for the mess 感谢您的帮助,对不起您的混乱

I'm not entirely clear on what on exactly what you are trying to do, but this might get at it. 对于您要尝试执行的操作,我尚不完全清楚,但这也许可以解决。

First create a function that will be run for each column. 首先创建一个将为每列运行的函数。

fn <- function(x){
    a = levels(x)
    n = n=sum(!is.na(x))
    blank = length(x) - sum(!is.na(x))
    dist = length(unique(x))
    c(column = a, n=n, blank=blank, distinct=dist )
}

Then use apply to apply the function to each column of the data.frame. 然后使用apply将函数应用于data.frame的每一列。 I've transposed it to provide rows. 我将其移置以提供行。

t(apply(mtcars, 2, fn))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM