简体   繁体   English

R中的函数,用于验证data.frame上特定列的存在

[英]Function in R for validating existence of specific columns on a data.frame

I'd like to validate that a data.frame contains columns with specific names. 我想验证data.frame包含具有特定名称的列。 Ideally this would be a utility function that I can just pass the data.frame and expected column names and the function will raise an error if the data.frame does not contain the expected columns. 理想情况下,这是一个实用函数,我可以只传递data.frame和预期的列名,如果data.frame不包含预期的列,则该函数将引发错误。 I have written my own function below, however, this seems like something that would already exist in the R ecosystem. 我在下面编写了自己的函数,但是,这似乎已经存在于R生态系统中。

My questions are: 我的问题是:

  1. Does such a function (or one-liner) already exist either in base R or in a common package? 这样的功能(或单行)是否已经存在于base R或公共包中?
  2. If not, any suggestions for my function (below)? 如果没有,对我的功能有什么建议(如下)?

Example of the function I have written to do this: 我为此编写的函数示例:

validate_df_columns <- function(df, columns) {
    chr_df <- deparse(substitute(df))
    chr_columns <- paste(columns, collapse = ", ")
    if (!('data.frame' %in% class(df))) {
        stop(paste("Argument", df, "must be a data.frame."))
    }
    if (sum(colnames(df) %in% columns) != length(columns)) {
        stop(paste(chr_df, "must contain the columns", chr_columns))
    }
}

validate_df_columns(data.frame(a=1:3, b=4:6), c("a", "b", "c'"))
## Error in validate_df_columns(data.frame(a = 1:3, b = 4:6), c("a", "b",  : 
##   data.frame(a = 1:3, b = 4:6) must contain the columns a, b, c'

The packages tibble and rlang , part of tidyverse have a function to check this : tibblerlang的一部分, tidyverse有一个函数来检查这一点:

library(tibble) # or library(rlang) or library(tidyverse)
has_name(iris, c("Species","potatoe"))
# [1]  TRUE FALSE

Technically it lives in rlang and its code is just : 从技术上讲,它生活在rlang ,其代码如下:

function (x, name) 
{
    name %in% names2(x)
}

where rlang::names2 is an enhanced version of base::names which returns a vector of empty strings rather than NULL when the object doesn't have names. 其中rlang::names2base::names的增强版本,当对象没有base::names ,它返回空字符串向量,而不是NULL

Here's a way to rewrite your function : 这是重写函数的一种方法:

validate_df_columns <- function(df, columns){
if (!is.data.frame(df)) {
    stop(paste("Argument", deparse(substitute(df)), "must be a data.frame."))
}
  if(!all(i <- rlang::has_name(df,columns)))
    stop(sprintf(
      "%s doesn't contain: %s",
      deparse(substitute(df)),
      paste(columns[!i], collapse=", ")))
}

validate_df_columns(iris, c("Species","potatoe","banana"))
# Error in validate_df_columns(iris, c("Species", "potatoe", "banana")) : 
# iris doesn't contain: potatoe, banana

Using deparse(substitute(...)) here makes little sense to me though, as it's not used interactively, clearer in my opinion to just say "df" . 不过,在这里使用deparse(substitute(...))对我来说意义不大,因为它不是交互式使用的,所以我认为只说"df"更清楚。

The %in% operator works with pairs of vectors, so there is already a one-liner we can use here. %in%运算符适用于成对的向量,因此我们已经可以在这里使用单线了。 Consider: 考虑:

df <- data.frame(a=c(1:3), b=c(4:6), c=c(7:9))
names <- c("a", "c", "blah", "doh")
names[names %in% names(df)]

[1] "a" "c"

If you want to assert that the data frame contains all the input names, then just use: 如果要断言数据框包含所有输入名称,则只需使用:

length(names %in% names(df)) == length(names)     # to check all inputs are present
length(names %in% names(df)) == length(names(df)) # to check that input matches df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM