[英]Function in R for validating existence of specific columns on a data.frame
I'd like to validate that a data.frame
contains columns with specific names. 我想验证data.frame
包含具有特定名称的列。 Ideally this would be a utility function that I can just pass the data.frame
and expected column names and the function will raise an error if the data.frame
does not contain the expected columns. 理想情况下,这是一个实用函数,我可以只传递data.frame
和预期的列名,如果data.frame
不包含预期的列,则该函数将引发错误。 I have written my own function below, however, this seems like something that would already exist in the R ecosystem. 我在下面编写了自己的函数,但是,这似乎已经存在于R生态系统中。
My questions are: 我的问题是:
Example of the function I have written to do this: 我为此编写的函数示例:
validate_df_columns <- function(df, columns) {
chr_df <- deparse(substitute(df))
chr_columns <- paste(columns, collapse = ", ")
if (!('data.frame' %in% class(df))) {
stop(paste("Argument", df, "must be a data.frame."))
}
if (sum(colnames(df) %in% columns) != length(columns)) {
stop(paste(chr_df, "must contain the columns", chr_columns))
}
}
validate_df_columns(data.frame(a=1:3, b=4:6), c("a", "b", "c'"))
## Error in validate_df_columns(data.frame(a = 1:3, b = 4:6), c("a", "b", :
## data.frame(a = 1:3, b = 4:6) must contain the columns a, b, c'
The packages tibble
and rlang
, part of tidyverse
have a function to check this : 包tibble
和rlang
的一部分, tidyverse
有一个函数来检查这一点:
library(tibble) # or library(rlang) or library(tidyverse)
has_name(iris, c("Species","potatoe"))
# [1] TRUE FALSE
Technically it lives in rlang
and its code is just : 从技术上讲,它生活在rlang
,其代码如下:
function (x, name)
{
name %in% names2(x)
}
where rlang::names2
is an enhanced version of base::names
which returns a vector of empty strings rather than NULL
when the object doesn't have names. 其中rlang::names2
是base::names
的增强版本,当对象没有base::names
,它返回空字符串向量,而不是NULL
。
Here's a way to rewrite your function : 这是重写函数的一种方法:
validate_df_columns <- function(df, columns){
if (!is.data.frame(df)) {
stop(paste("Argument", deparse(substitute(df)), "must be a data.frame."))
}
if(!all(i <- rlang::has_name(df,columns)))
stop(sprintf(
"%s doesn't contain: %s",
deparse(substitute(df)),
paste(columns[!i], collapse=", ")))
}
validate_df_columns(iris, c("Species","potatoe","banana"))
# Error in validate_df_columns(iris, c("Species", "potatoe", "banana")) :
# iris doesn't contain: potatoe, banana
Using deparse(substitute(...))
here makes little sense to me though, as it's not used interactively, clearer in my opinion to just say "df"
. 不过,在这里使用deparse(substitute(...))
对我来说意义不大,因为它不是交互式使用的,所以我认为只说"df"
更清楚。
The %in%
operator works with pairs of vectors, so there is already a one-liner we can use here. %in%
运算符适用于成对的向量,因此我们已经可以在这里使用单线了。 Consider: 考虑:
df <- data.frame(a=c(1:3), b=c(4:6), c=c(7:9))
names <- c("a", "c", "blah", "doh")
names[names %in% names(df)]
[1] "a" "c"
If you want to assert that the data frame contains all the input names, then just use: 如果要断言数据框包含所有输入名称,则只需使用:
length(names %in% names(df)) == length(names) # to check all inputs are present
length(names %in% names(df)) == length(names(df)) # to check that input matches df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.