简体   繁体   English

如何通过R中的组/因子计算NA值的长度?

[英]How to count length of NA values by group/factor in R?

I am tasked with manipulating data obtained from 1258 unique surveys. 我的任务是处理从1258次独特调查获得的数据。

In terms of dimensions. 在尺寸方面。 28 million individual observations (including NA) -8 columns (variables). 2800万个人观察值(包括NA)-8列(变量)。 object name : dat 对象名称:dat

The column/variable I am particularly interested in is education (edu). 我特别感兴趣的列/变量是教育(edu)。 I want to get the length of NA and Non-NA values (for edu) for those studies by aggregating (data$ edu ~ id_study ) 我想通过聚合得到NA和非NA值的这些研究的长度 (对于EDU)(数据$ EDU〜id_study)

这是8列的前五个条目,我想保留id_study

So far, I have used this code to work out the number of studies which contain at least 1 or more entries on edu. 到目前为止,我已使用此代码计算出包含至少1个或多个edu条目的研究数量。

numbers <- aggregate(dat$edu ~ dat$id_study, data=dat, FUN=length)

汇总结果

I have the result I need for quantifying the numbers of unique id_study that have data on edu. 我得到了需要量化具有edu数据的唯一id_study数量的结果。 This ticks box one. 这会打勾第一个方框。

Now I need to do the same for the unique id_study that have nothing at all on education. 现在,我需要对唯一没有任何教育意义的id_study做同样的事情。 How do i do this? 我该怎么做呢?

I've tried so many codes to work out the length of NAs for studies that do not have anything on edu. 我已经尝试了许多代码来计算NA的长度,以便研究没有edu的任何内容。

aggregate_2 <- aggregate(dat$edu ~ id_study, data=dat, FUN=length(dat[!is.na(dat)]))

this does not work :( 这不起作用:(

Can anyone shed some light on this matter please? 任何人都可以对这个问题有所了解吗?

thank you 谢谢

EDIT ****** Just to clarify if i was not clear in my question. 编辑******只是为了澄清我是否不清楚我的问题。 There are 1258 unique surveys/studies,(some surveys may be for multiple years, eg ALB_2013 and ALB_2014 under id_study). 有1258个唯一的调查/研究(某些调查可能会持续多年,例如id_study下的ALB_2013和ALB_2014)。

Out of these surveys, using equation 1 code and the code i put in the description, code 1 , I worked out that 530 of these 1258 surveys provided >=1 individual observation under the edu column. 在这些调查中,使用等式1代码和我在描述中输入的代码1得出了1258个调查中的530个,在edu列下提供了> = 1个单独的观察结果。

This must mean 728 Unique surveys did not provide any information at all under the edu. 这必须意味着728个唯一调查在edu下根本没有提供任何信息。 I want to work out the names of the 728 surveys and using a function, hopefully want to work out the length of NAs per survey which didn't provide any information at all. 我想算出728个调查的名称并使用一个函数,希望能算出每个调查的NA的长度,而根本没有提供任何信息。

I hope this makes sense. 我希望这是有道理的。

id_study (name of the survey) id (survey id) column i'm interested in is "edu". 我感兴趣的id_study(调查的名称)id(调查ID)列是“ edu”。

First off: Posting a screenshot of your data is bad practice, as it would require SO respondents to manually type in your sample data. 首先,发布数据的屏幕截图是一种不好的做法,因为这需要SO受访者手动输入您的示例数据。 Use dput to post (part of) your data. 使用dput发布(部分)您的数据。 For future questions, follow the advice and links in Sotos' first comment! 如有其他疑问,请遵循Sotos第一个评论中的建议和链接!

That aside, how about the following: 除此之外,以下内容如何:

numbers <- aggregate(
    edu ~ id_study, 
    data = dat, 
    FUN = function(x) c(n_nonNA = sum(!is.na(x)), n_NA = sum(is.na(x))))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM