简体   繁体   English

如何在NA发生后丢弃观察组内的观察结果?

[英]How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. 我正在尝试清理我的数据。 One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. 其中一个标准是我需要一个不间断的变量“资产”序列,但我有一些NA。 However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event. 但是,我不能简单地删除NA观测值,但需要删除NA事件后的所有后续观测值。

Here an example: 这是一个例子:

productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf

#    productreference Year assets
# 1                 1 2000      2
# 2                 1 2001      3
# 3                 1 2002     NA
# 4                 1 2003      2
# 5                 2 1999     34
# 6                 2 2000     NA
# 7                 2 2001     45
# 8                 3 2005      1
# 9                 3 2006     23
# 10                3 2007     34
# 11                3 2008     56
# 12                4 1998     56
# 13                4 1999     67
# 14                4 2000     23
# 15                5 2000     23
# 16                5 2001     NA
# 17                5 2002     14
# 18                5 2003     NA

I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA. 我已经看到有一种方法可以使用plyr按组执行功能,我也可以创建一个0-1的列,其中0表示资产有一个有效的条目,1表示缺少NA的值。

mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1

I have a very large data set so cannot manually delete the rows and would greatly appreciate your help! 我有一个非常大的数据集,所以无法手动删除行,非常感谢您的帮助!

I believe this is what you want: 我相信这就是你想要的:

library(dplyr)
group_by(mydf, productreference) %>%
    filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
# 
#    productreference  Year assets
#               (dbl) (dbl)  (dbl)
# 1                 1  2000      2
# 2                 1  2001      3
# 3                 2  1999     34
# 4                 3  2005      1
# 5                 3  2006     23
# 6                 3  2007     34
# 7                 3  2008     56
# 8                 4  1998     56
# 9                 4  1999     67
# 10                4  2000     23
# 11                5  2000     23

Here is the same approach using data.table : 这是使用data.table的相同方法:

library(data.table)
dt <- as.data.table(mydf)

dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]

#    productreference Year assets nas
# 1:                1 2000      2   0
# 2:                1 2001      3   0
# 3:                2 1999     34   0
# 4:                3 2005      1   0
# 5:                3 2006     23   0
# 6:                3 2007     34   0
# 7:                3 2008     56   0
# 8:                4 1998     56   0
# 9:                4 1999     67   0
#10:                4 2000     23   0
#11:                5 2000     23   0

Here is a base R option 这是一个base R选项

mydf[unsplit(lapply(split(mydf, mydf$productreference),
     function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]    
#   productreference Year assets
#1                 1 2000      2
#2                 1 2001      3
#5                 2 1999     34
#8                 3 2005      1
#9                 3 2006     23
#10                3 2007     34
#11                3 2008     56
#12                4 1998     56
#13                4 1999     67
#14                4 2000     23
#15                5 2000     23

Or an option with data.table 或者是data.table的选项

library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)] 
                    else .SD, by = productreference]

You can do it using base R and a for loop. 你可以使用base R和for循环来完成它。 This code is a bit longer than some of the code in the other answers. 此代码比其他答案中的某些代码稍长。 In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA , and exclude that row and all following rows. 在循环中,我们通过productreferencemydf进行子集,对于每个子集,我们查找第一次出现的assets==NA ,并排除该行和所有后续行。

mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
  s1 <- mydf[mydf$productreference==i,]
  s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
  mydf2 <- rbind(mydf2, s2)
  mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当其中一个观察满足特定条件时,如何删除组中的所有行? - How do I drop all the rows within a group when one of the observations meets a certain condition? 我如何堆叠我的数据集,以便每个观察结果都与组内除自身之外的所有其他观察结果相关? - How can I stack my dataset so each observation relates to all other observations but itself, within a group? 如果最近n次观察中没有任何NA,如何选择列? 如果相邻NA的观测值多于x,如何删除列? - How to select columns if there is not any NA in the last n observations? How to drop columns if there are more than x adjacent NA's observations? 如何在不丢失R中的NA值的情况下有条件地从数据帧中删除观测值? - How can I remove observations from a data frame conditionally without losing NA values in R? 如何更快地对组内的观察结果进行排名? - How can I rank observations in-group faster? 如何计算组内二元观察的数量? - How to count number of binary observations within a group? 如何以因组而异的数字间隔为条件删除观察 - How to drop observations conditional on interval of numbers that varies by group 我如何要求 R 每次在一组观察中出现特定模式(即列值从 1 变为 0)时标记它? - How can I ask R to flag a specific pattern (i.e., column value changes from 1 to 0) each time it occurs within a group of observations? 在日期间隔或 NA 内过滤观察值 - filter observations within date interval or NA 如何过滤时间窗口内最近发生的事件? - How can I filter for most recent occurrence within a time window?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM