[英]How to summarise across different types of variables with dplyr::c_across()
I have data with different types of variables.我有不同类型变量的数据。 Some are character, some factors, and some numeric, like below:一些是字符,一些因素,还有一些数字,如下所示:
df <- data.frame(a = c("tt", "ss", "ss", NA), b=c(2,3,NA,1), c=c(1,2,NA, NA), d=c("tt", "ss", "ss", NA))
I'm trying to count the number of missing values per observation using c_across
in dplyr
However, c_across
doesn't seem to be able to combine different type of values, as the error message below suggests我正在尝试使用c_across
中的dplyr
计算每个观察值的缺失值数量但是, c_across
似乎无法组合不同类型的值,如下面的错误消息所示
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across())))
Error: Problem with
summarise()
inputNAs
.错误:summarise()
输入NAs
。 x Can't combinea
<factor> andb
. x 不能组合a
<factor> 和b
。 ℹ InputNAs
issum(is.na(c_across()))
. ℹ 输入NAs
是sum(is.na(c_across()))
。 ℹ The error occurred in row 1. ℹ 错误发生在第 1 行。
Indeed, if I include only numeric variables, it works.事实上,如果我只包含数字变量,它就可以工作。
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(b:c))))
Same thing if I include only character variables如果我只包含字符变量,同样的事情
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(c(a,d)))))
I could solve the issue without using c_across
like below, but I have lots of variables, so it's not very practical.我可以在不使用c_across
情况下解决这个问题,如下所示,但是我有很多变量,所以它不是很实用。
df %>%
rowwise() %>%
summarise(NAs = is.na(a)+is.na(b)+is.na(c)+is.na(d))
I could use the traditional apply
approach, like below, but I'd like to solve this using dplyr
.我可以使用传统的apply
方法,如下所示,但我想使用dplyr
解决这个dplyr
。
apply(df, 1, function(x)sum(is.na(x)))
Any suggestions as to how to compute the number of missing values, row-wise, efficiently, and using dplyr
?关于如何按行、有效地和使用dplyr
计算缺失值数量的任何建议?
I would suggest this approach.我会建议这种方法。 The issue is because of two things.这个问题是因为两件事。 First, different type of variables in your dataframe an second that you need a key variable for the rowwise style task.首先,您的数据框中的不同类型的变量,您需要一个用于 rowwise 样式任务的关键变量。 So, in next code we first transform variables into a similar type, then we create an id based on the number of row.因此,在接下来的代码中,我们首先将变量转换为类似的类型,然后根据行数创建一个 id。 With this we use that element as input for rowwise()
and then we can use c_across()
function.有了这个,我们使用该元素作为rowwise()
输入,然后我们可以使用c_across()
函数。 Here the code (I have used you df
data):这里的代码(我用过你的df
数据):
library(tidyverse)
#Code
df %>%
mutate_at(vars(everything()),funs(as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:输出:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3
And we can avoid the mutate_at()
function using the new across()
with mutate()
to homologate the variables:我们可以使用新的mutate_at()
across()
和mutate()
来避免mutate_at()
函数来mutate_at()
变量:
#Code 2
df %>%
mutate(across(a:d,~as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:输出:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3
A much faster option is not to use rowwise
or c_across
, but with rowSums
一个更快的选择是不使用rowwise
或c_across
,而是使用rowSums
library(dplyr)
df %>%
mutate(NAs = rowSums(is.na(.)))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 3
If we want to select
certain columns ie numeric
如果我们想select
某些列,即numeric
df %>%
mutate(NAs = rowSums(is.na(select(., where(is.numeric)))))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.