如何在NA发生后丢弃观察组内的观察结果？

Question

I am trying to clean my data. 我正在尝试清理我的数据。 One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. 其中一个标准是我需要一个不间断的变量“资产”序列，但我有一些NA。 However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event. 但是，我不能简单地删除NA观测值，但需要删除NA事件后的所有后续观测值。

Here an example: 这是一个例子：

productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf

#    productreference Year assets
# 1                 1 2000      2
# 2                 1 2001      3
# 3                 1 2002     NA
# 4                 1 2003      2
# 5                 2 1999     34
# 6                 2 2000     NA
# 7                 2 2001     45
# 8                 3 2005      1
# 9                 3 2006     23
# 10                3 2007     34
# 11                3 2008     56
# 12                4 1998     56
# 13                4 1999     67
# 14                4 2000     23
# 15                5 2000     23
# 16                5 2001     NA
# 17                5 2002     14
# 18                5 2003     NA

I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA. 我已经看到有一种方法可以使用plyr按组执行功能，我也可以创建一个0-1的列，其中0表示资产有一个有效的条目，1表示缺少NA的值。

mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1

I have a very large data set so cannot manually delete the rows and would greatly appreciate your help! 我有一个非常大的数据集，所以无法手动删除行，非常感谢您的帮助！

Answer 1

I believe this is what you want: 我相信这就是你想要的：

library(dplyr)
group_by(mydf, productreference) %>%
    filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
# 
#    productreference  Year assets
#               (dbl) (dbl)  (dbl)
# 1                 1  2000      2
# 2                 1  2001      3
# 3                 2  1999     34
# 4                 3  2005      1
# 5                 3  2006     23
# 6                 3  2007     34
# 7                 3  2008     56
# 8                 4  1998     56
# 9                 4  1999     67
# 10                4  2000     23
# 11                5  2000     23

Answer 2

Here is the same approach using data.table : 这是使用data.table的相同方法：

library(data.table)
dt <- as.data.table(mydf)

dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]

#    productreference Year assets nas
# 1:                1 2000      2   0
# 2:                1 2001      3   0
# 3:                2 1999     34   0
# 4:                3 2005      1   0
# 5:                3 2006     23   0
# 6:                3 2007     34   0
# 7:                3 2008     56   0
# 8:                4 1998     56   0
# 9:                4 1999     67   0
#10:                4 2000     23   0
#11:                5 2000     23   0

Answer 3

Here is a base R option 这是一个base R选项

mydf[unsplit(lapply(split(mydf, mydf$productreference),
     function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]    
#   productreference Year assets
#1                 1 2000      2
#2                 1 2001      3
#5                 2 1999     34
#8                 3 2005      1
#9                 3 2006     23
#10                3 2007     34
#11                3 2008     56
#12                4 1998     56
#13                4 1999     67
#14                4 2000     23
#15                5 2000     23

Or an option with data.table 或者是data.table的选项

library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)] 
                    else .SD, by = productreference]

Answer 4

You can do it using base R and a for loop. 你可以使用base R和for循环来完成它。 This code is a bit longer than some of the code in the other answers. 此代码比其他答案中的某些代码稍长。 In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA , and exclude that row and all following rows. 在循环中，我们通过productreference对mydf进行子集，对于每个子集，我们查找第一次出现的assets==NA ，并排除该行和所有后续行。

mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
  s1 <- mydf[mydf$productreference==i,]
  s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
  mydf2 <- rbind(mydf2, s2)
  mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

如何在NA发生后丢弃观察组内的观察结果？

问题描述

4 个解决方案

解决方案1
5 2016-06-15 21:52:46

解决方案2
3 2016-06-15 22:11:18

解决方案3
2 已采纳 2016-06-16 02:32:50

解决方案4
1 2016-06-16 03:08:20

如何在NA发生后丢弃观察组内的观察结果？

问题描述

4 个解决方案

解决方案1 5 2016-06-15 21:52:46

解决方案2 3 2016-06-15 22:11:18

解决方案3 2 已采纳 2016-06-16 02:32:50

解决方案4 1 2016-06-16 03:08:20

解决方案1
5 2016-06-15 21:52:46

解决方案2
3 2016-06-15 22:11:18

解决方案3
2 已采纳 2016-06-16 02:32:50

解决方案4
1 2016-06-16 03:08:20