简体   繁体   English

计算R中每年没有N / A的观测数

[英]Count number of observations without N/A per year in R

I have a dataset and I want to summarize the number of observations without the missing values (denoted by NA). 我有一个数据集,我想总结没有缺失值的观测数量(用NA表示)。

My data is similar as the following: 我的数据类似如下:

data <- read.table(header = TRUE, 
               stringsAsFactors = FALSE, 
               text="CompanyNumber ResponseVariable Year ExplanatoryVariable1 ExplanatoryVariable2
               1 2.5 2000 1 2
               1 4 2001 3 1
               1 3 2002 NA 7
               2 1 2000 3 NA
               2 2.4 2001 0 4
               2 6 2002 2 9
               3 10 2000 NA 3")

I was planning to use the package dplyr, but that does only take the years into account and not the different variables: 我打算使用包dplyr,但这只需要考虑几年而不是不同的变量:

library(dplyr)
data %>% 
  group_by(Year) %>%
  summarise(number = n())

How can I obtain the following outcome? 我怎样才能获得以下结果?

                    2000 2001 2002
ExplanatoryVariable1  2   2    1 
ExplanatoryVariable2  2   2    2

To get the counts, you can start by using: 要获得计数,您可以先使用:

library(dplyr)
data %>% 
  group_by(Year) %>% 
  summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.)))
## A tibble: 3 x 3
#   Year ExplanatoryVariable1 ExplanatoryVariable2
#  <int>                <int>                <int>
#1  2000                    2                    2
#2  2001                    2                    2
#3  2002                    1                    2

If you want to reshape it as shown in your question, you can extend the pipe using tidyr functions: 如果要像问题中所示重新整形,可以使用tidyr函数扩展管道:

library(tidyr)
data %>% 
  group_by(Year) %>% 
  summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.))) %>% 
  gather(var, count, -Year) %>% 
  spread(Year, count)
## A tibble: 2 x 4
#                   var `2000` `2001` `2002`
#*                <chr>  <int>  <int>  <int>
#1 ExplanatoryVariable1      2      2      1
#2 ExplanatoryVariable2      2      2      2

Just to let OP know, since they have ~200 explanatory variables to select. 只是让OP知道,因为他们有~200个解释变量可供选择。 You can use another option of summarise_at to select the variables. 您可以使用另一个summarise_at选项来选择变量。 You can simply name the first:last variable, if they are ordered correctly in the data, for example: 您可以简单地命名第一个:last变量,如果它们在数据中正确排序,例如:

data %>% 
  group_by(Year) %>%
  summarise_at(vars(ExplanatoryVariable1:ExplanatoryVariable2), ~sum(!is.na(.))) 

Or: 要么:

data %>% 
  group_by(Year) %>% 
  summarise_at(3:4, ~sum(!is.na(.))) 

Or store the variable names in a vector and use that: 或者将变量名称存储在向量中并使用:

vars <- names(data)[4:5]
data %>% 
  group_by(Year) %>% 
  summarise_at(vars, ~sum(!is.na(.))) 
data %>%
  gather(cat, val, -(1:3)) %>%
  filter(complete.cases(.)) %>%
  group_by(Year, cat) %>%
  summarize(n = n()) %>%
  spread(Year, n)

# # A tibble: 2 x 4
#                    cat `2000` `2001` `2002`
# *                <chr>  <int>  <int>  <int>
# 1 ExplanatoryVariable1      2      2      1
# 2 ExplanatoryVariable2      2      2      2

Should do it. 应该这样做。 You start by making the data stacked, and the simply calculating the n for both year and each explanatory variable. 首先将数据堆叠起来,然后简单地计算年份和每个解释变量的n。 If you want the data back to wide format, then use spread , but either way without spread , you get the counts for both variables. 如果您希望将数据恢复为宽格式,则使用spread ,但无论如何都不spread ,您将获得两个变量的计数。

Using base R: 使用基数R:

  do.call(cbind,by(data[3:5], data$Year,function(x) colSums(!is.na(x[-1]))))                       
                      2000 2001 2002
 ExplanatoryVariable1    2    2    1
 ExplanatoryVariable2    2    2    2

For aggregate: 对于聚合:

 aggregate(.~Year,data[3:5],function(x) sum(!is.na(x)),na.action = function(x)x)

You could do it with aggregate in base R. 你可以用基数R中的aggregate来做到这一点。

aggregate(list(ExplanatoryVariable1 = data$ExplanatoryVariable1,
               ExplanatoryVariable2 = data$ExplanatoryVariable2),
          list(Year = data$Year),
          function(x) length(x[!is.na(x)]))
#  Year ExplanatoryVariable1 ExplanatoryVariable2
#1 2000                    2                    2
#2 2001                    2                    2
#3 2002                    1                    2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM