简体   繁体   中英

Delete certain rows in a group of rows in R

Suppose I have this dataset

Id Name Price sales Profit Month Category Mode Supplier
1    A     2     0      0     1        X    K     John
1    A     2     0      0     2        X    K     John
1    A     2     5      8     3        X    K     John
1    A     2     5      8     4        X    L      Sam
2    B     2     3      4     1        X    L      Sam
2    B     2     0      0     2        X    L      Sam
2    B     2     0      0     3        X    M     John
2    B     2     0      0     4        X    L     John
3    C     2     0      0     1        X    K     John
3    C     2     8     10     2        Y    M     John
3    C     2     8     10     3        Y    K     John
3    C     2     0      0     4        Y    K     John
5    E     2     0      0     1        Y    M      Sam
5    E     2     5      5     2        Y    L      Sam
5    E     2     5      9     3        Y    M      Sam
5    E     2     0      0     4        Z    M     Kyle
5    E     2     5      8     5        Z    L     Kyle
5    E     2     5      8     6        Z    M     Kyle

I want to delete rows with zeroes for Sales and Profit column by Id group So for a certain Id if two or more consecutive rows have zero values for sales and profit those rows will get delete. So this dataset will become like this.

Id Name Price sales Profit Month Category Mode Supplier
1    A     2     5      8     3        X    K     John
1    A     2     5      8     4        X    L      Sam
2    B     2     3      4     1        X    L      Sam
3    C     2     0      0     1        X    K     John
3    C     2     8     10     2        Y    M     John
3    C     2     8     10     3        Y    K     John
3    C     2     0      0     4        Y    K     John
5    E     2     0      0     1        Y    M      Sam
5    E     2     5      5     2        Y    L      Sam
5    E     2     5      9     3        Y    M      Sam
5    E     2     0      0     4        Z    M     Kyle
5    E     2     5      8     5        Z    L     Kyle
5    E     2     5      8     6        Z    M     Kyle

I can remove all rows if they have zero values for Sales and Profit with

df1 = df[!(df$sales==0 & test$Profit==0),]

But how to delete rows only in certain group in this case by Id

PS The idea is to delete entries for those products if they started selling after few months or got abandoned after few months in a year cycle.

Here's an approach using rleid from "data.table":

library(data.table)
as.data.table(mydf)[, N := .N, by = .(Id, rleid(sales == 0 & Profit == 0))][
    !(sales == 0 & Profit == 0 & N >= 2)]
##     Id Name Price sales Profit Month Category Mode Supplier N
##  1:  1    A     2     5      8     3        X    K     John 2
##  2:  1    A     2     5      8     4        X    L      Sam 2
##  3:  2    B     2     3      4     1        X    L      Sam 1
##  4:  3    C     2     0      0     1        X    K     John 1
##  5:  3    C     2     8     10     2        Y    M     John 2
##  6:  3    C     2     8     10     3        Y    K     John 2
##  7:  3    C     2     0      0     4        Y    K     John 1
##  8:  5    E     2     0      0     1        Y    M      Sam 1
##  9:  5    E     2     5      5     2        Y    L      Sam 2
## 10:  5    E     2     5      9     3        Y    M      Sam 2
## 11:  5    E     2     0      0     4        Z    M     Kyle 1
## 12:  5    E     2     5      8     5        Z    L     Kyle 2
## 13:  5    E     2     5      8     6        Z    M     Kyle 2

Here's how to do it with dplyr . Basically, I'm only keeping lines that are not zero OR that the previous/following lines is not zero.

table1 %>%
group_by(Id) %>%
mutate(Lag=lag(sales),Lead=lead(sales)) %>%
rowwise() %>%
mutate(Min=min(Lag,Lead,na.rm=TRUE)) %>%
filter(sales>0|Min>0)  %>%
select(-Lead,-Lag,-Min)

      Id  Name Price sales Profit Month Category  Mode Supplier
   (int) (chr) (int) (int)  (int) (int)    (chr) (chr)    (chr)
1      1     A     2     5      8     3        X     K     John
2      1     A     2     5      8     4        X     L      Sam
3      2     B     2     3      4     1        X     L      Sam
4      3     C     2     0      0     1        X     K     John
5      3     C     2     8     10     2        Y     M     John
6      3     C     2     8     10     3        Y     K     John
7      3     C     2     0      0     4        Y     K     John
8      5     E     2     0      0     1        Y     M      Sam
9      5     E     2     5      5     2        Y     L      Sam
10     5     E     2     5      9     3        Y     M      Sam
11     5     E     2     0      0     4        Z     M     Kyle
12     5     E     2     5      8     5        Z     L     Kyle
13     5     E     2     5      8     6        Z     M     Kyle

Data

table1 <-read.table(text="
Id,Name,Price,sales,Profit,Month,Category,Mode,Supplier
1,A,2,0,0,1,X,K,John
1,A,2,0,0,2,X,K,John
1,A,2,5,8,3,X,K,John
1,A,2,5,8,4,X,L,Sam
2,B,2,3,4,1,X,L,Sam
2,B,2,0,0,2,X,L,Sam
2,B,2,0,0,3,X,M,John
2,B,2,0,0,4,X,L,John
3,C,2,0,0,1,X,K,John
3,C,2,8,10,2,Y,M,John
3,C,2,8,10,3,Y,K,John
3,C,2,0,0,4,Y,K,John
5,E,2,0,0,1,Y,M,Sam
5,E,2,5,5,2,Y,L,Sam
5,E,2,5,9,3,Y,M,Sam
5,E,2,0,0,4,Z,M,Kyle
5,E,2,5,8,5,Z,L,Kyle
5,E,2,5,8,6,Z,M,Kyle
",sep=",",stringsAsFactors =FALSE, header=TRUE)

UPDATE To filter on more than one column with these criteria, here's how to do it. In the present case, the result is the same because when sales are 0, profits are also 0.

library(dplyr)
table1 %>%
group_by(Id) %>%
mutate(LagS=lag(sales),LeadS=lead(sales),LagP=lag(Profit),LeadP=lead(Profit)) %>%
rowwise() %>%
mutate(MinS=min(LagS,LeadS,na.rm=TRUE),MinP=min(LagP,LeadP,na.rm=TRUE)) %>%
filter(sales>0|MinS>0|Profit>0|MinP>0)  %>%         # "|" means OR
select(-LeadS,-LagS,-MinS,-LeadP,-LagP,-MinP)

I can't do it in one line, but here it is in three:

x <- df$sales==0 & df$Profit==0
y <- cumsum(c(1,head(x,-1)!=tail(x,-1)))
df[ave(x,df$Id,y,FUN=sum)<2,]

#    Id Name Price sales Profit Month Category Mode Supplier
# 3   1    A     2     5      8     3        X    K     John
# 4   1    A     2     5      8     4        X    L      Sam
# 5   2    B     2     3      4     1        X    L      Sam
# 9   3    C     2     0      0     1        X    K     John
# 10  3    C     2     8     10     2        Y    M     John
# 11  3    C     2     8     10     3        Y    K     John
# 12  3    C     2     0      0     4        Y    K     John
# 13  5    E     2     0      0     1        Y    M      Sam
# 14  5    E     2     5      5     2        Y    L      Sam
# 15  5    E     2     5      9     3        Y    M      Sam
# 16  5    E     2     0      0     4        Z    M     Kyle
# 17  5    E     2     5      8     5        Z    L     Kyle
# 18  5    E     2     5      8     6        Z    M     Kyle

This works by first identifying all rows where sales and Profit are both 0 ( x ). The variable y groups consecutive TRUE and FALSE values. The ave() function splits the first input variable ( x ) according to the subsequent input variables ( df$Id and y ) then applies the function within groups. Since the function is sum() , it will add up all the TRUE values in x , then it returns a vector of the same length as x , so we just need to keep all the rows where the result is less than 2.

Here my solution:

aux <- lapply(tapply(df$sales + df$Profit, df$Id, rle), function(x) 
       with(x, cbind(rep(values, lengths), rep(lengths, lengths))))

df[!(do.call(rbind, aux)[,1]==0 & do.call(rbind, aux)[,2] >= 2),]

   Id Name Price sales Profit Month Category Mode Supplier
3   1    A     2     5      8     3        X    K     John
4   1    A     2     5      8     4        X    L      Sam
5   2    B     2     3      4     1        X    L      Sam
9   3    C     2     0      0     1        X    K     John
10  3    C     2     8     10     2        Y    M     John
11  3    C     2     8     10     3        Y    K     John
12  3    C     2     0      0     4        Y    K     John
13  5    E     2     0      0     1        Y    M      Sam
14  5    E     2     5      5     2        Y    L      Sam
15  5    E     2     5      9     3        Y    M      Sam
16  5    E     2     0      0     4        Z    M     Kyle
17  5    E     2     5      8     5        Z    L     Kyle
18  5    E     2     5      8     6        Z    M     Kyle

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM