Suppose I have this dataset
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 0 0 1 X K John
1 A 2 0 0 2 X K John
1 A 2 5 8 3 X K John
1 A 2 5 8 4 X L Sam
2 B 2 3 4 1 X L Sam
2 B 2 0 0 2 X L Sam
2 B 2 0 0 3 X M John
2 B 2 0 0 4 X L John
3 C 2 0 0 1 X K John
3 C 2 8 10 2 Y M John
3 C 2 8 10 3 Y K John
3 C 2 0 0 4 Y K John
5 E 2 0 0 1 Y M Sam
5 E 2 5 5 2 Y L Sam
5 E 2 5 9 3 Y M Sam
5 E 2 0 0 4 Z M Kyle
5 E 2 5 8 5 Z L Kyle
5 E 2 5 8 6 Z M Kyle
I want to delete rows with zeroes for Sales
and Profit
column by Id
group So for a certain Id
if two or more consecutive rows have zero values for sales
and profit
those rows will get delete. So this dataset will become like this.
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 8 3 X K John
1 A 2 5 8 4 X L Sam
2 B 2 3 4 1 X L Sam
3 C 2 0 0 1 X K John
3 C 2 8 10 2 Y M John
3 C 2 8 10 3 Y K John
3 C 2 0 0 4 Y K John
5 E 2 0 0 1 Y M Sam
5 E 2 5 5 2 Y L Sam
5 E 2 5 9 3 Y M Sam
5 E 2 0 0 4 Z M Kyle
5 E 2 5 8 5 Z L Kyle
5 E 2 5 8 6 Z M Kyle
I can remove all rows if they have zero values for Sales
and Profit
with
df1 = df[!(df$sales==0 & test$Profit==0),]
But how to delete rows only in certain group in this case by Id
PS The idea is to delete entries for those products if they started selling after few months or got abandoned after few months in a year cycle.
Here's an approach using rleid
from "data.table":
library(data.table)
as.data.table(mydf)[, N := .N, by = .(Id, rleid(sales == 0 & Profit == 0))][
!(sales == 0 & Profit == 0 & N >= 2)]
## Id Name Price sales Profit Month Category Mode Supplier N
## 1: 1 A 2 5 8 3 X K John 2
## 2: 1 A 2 5 8 4 X L Sam 2
## 3: 2 B 2 3 4 1 X L Sam 1
## 4: 3 C 2 0 0 1 X K John 1
## 5: 3 C 2 8 10 2 Y M John 2
## 6: 3 C 2 8 10 3 Y K John 2
## 7: 3 C 2 0 0 4 Y K John 1
## 8: 5 E 2 0 0 1 Y M Sam 1
## 9: 5 E 2 5 5 2 Y L Sam 2
## 10: 5 E 2 5 9 3 Y M Sam 2
## 11: 5 E 2 0 0 4 Z M Kyle 1
## 12: 5 E 2 5 8 5 Z L Kyle 2
## 13: 5 E 2 5 8 6 Z M Kyle 2
Here's how to do it with dplyr
. Basically, I'm only keeping lines that are not zero OR that the previous/following lines is not zero.
table1 %>%
group_by(Id) %>%
mutate(Lag=lag(sales),Lead=lead(sales)) %>%
rowwise() %>%
mutate(Min=min(Lag,Lead,na.rm=TRUE)) %>%
filter(sales>0|Min>0) %>%
select(-Lead,-Lag,-Min)
Id Name Price sales Profit Month Category Mode Supplier
(int) (chr) (int) (int) (int) (int) (chr) (chr) (chr)
1 1 A 2 5 8 3 X K John
2 1 A 2 5 8 4 X L Sam
3 2 B 2 3 4 1 X L Sam
4 3 C 2 0 0 1 X K John
5 3 C 2 8 10 2 Y M John
6 3 C 2 8 10 3 Y K John
7 3 C 2 0 0 4 Y K John
8 5 E 2 0 0 1 Y M Sam
9 5 E 2 5 5 2 Y L Sam
10 5 E 2 5 9 3 Y M Sam
11 5 E 2 0 0 4 Z M Kyle
12 5 E 2 5 8 5 Z L Kyle
13 5 E 2 5 8 6 Z M Kyle
Data
table1 <-read.table(text="
Id,Name,Price,sales,Profit,Month,Category,Mode,Supplier
1,A,2,0,0,1,X,K,John
1,A,2,0,0,2,X,K,John
1,A,2,5,8,3,X,K,John
1,A,2,5,8,4,X,L,Sam
2,B,2,3,4,1,X,L,Sam
2,B,2,0,0,2,X,L,Sam
2,B,2,0,0,3,X,M,John
2,B,2,0,0,4,X,L,John
3,C,2,0,0,1,X,K,John
3,C,2,8,10,2,Y,M,John
3,C,2,8,10,3,Y,K,John
3,C,2,0,0,4,Y,K,John
5,E,2,0,0,1,Y,M,Sam
5,E,2,5,5,2,Y,L,Sam
5,E,2,5,9,3,Y,M,Sam
5,E,2,0,0,4,Z,M,Kyle
5,E,2,5,8,5,Z,L,Kyle
5,E,2,5,8,6,Z,M,Kyle
",sep=",",stringsAsFactors =FALSE, header=TRUE)
UPDATE To filter on more than one column with these criteria, here's how to do it. In the present case, the result is the same because when sales are 0, profits are also 0.
library(dplyr)
table1 %>%
group_by(Id) %>%
mutate(LagS=lag(sales),LeadS=lead(sales),LagP=lag(Profit),LeadP=lead(Profit)) %>%
rowwise() %>%
mutate(MinS=min(LagS,LeadS,na.rm=TRUE),MinP=min(LagP,LeadP,na.rm=TRUE)) %>%
filter(sales>0|MinS>0|Profit>0|MinP>0) %>% # "|" means OR
select(-LeadS,-LagS,-MinS,-LeadP,-LagP,-MinP)
I can't do it in one line, but here it is in three:
x <- df$sales==0 & df$Profit==0
y <- cumsum(c(1,head(x,-1)!=tail(x,-1)))
df[ave(x,df$Id,y,FUN=sum)<2,]
# Id Name Price sales Profit Month Category Mode Supplier
# 3 1 A 2 5 8 3 X K John
# 4 1 A 2 5 8 4 X L Sam
# 5 2 B 2 3 4 1 X L Sam
# 9 3 C 2 0 0 1 X K John
# 10 3 C 2 8 10 2 Y M John
# 11 3 C 2 8 10 3 Y K John
# 12 3 C 2 0 0 4 Y K John
# 13 5 E 2 0 0 1 Y M Sam
# 14 5 E 2 5 5 2 Y L Sam
# 15 5 E 2 5 9 3 Y M Sam
# 16 5 E 2 0 0 4 Z M Kyle
# 17 5 E 2 5 8 5 Z L Kyle
# 18 5 E 2 5 8 6 Z M Kyle
This works by first identifying all rows where sales
and Profit
are both 0 ( x
). The variable y
groups consecutive TRUE
and FALSE
values. The ave()
function splits the first input variable ( x
) according to the subsequent input variables ( df$Id
and y
) then applies the function within groups. Since the function is sum()
, it will add up all the TRUE
values in x
, then it returns a vector of the same length as x
, so we just need to keep all the rows where the result is less than 2.
Here my solution:
aux <- lapply(tapply(df$sales + df$Profit, df$Id, rle), function(x)
with(x, cbind(rep(values, lengths), rep(lengths, lengths))))
df[!(do.call(rbind, aux)[,1]==0 & do.call(rbind, aux)[,2] >= 2),]
Id Name Price sales Profit Month Category Mode Supplier
3 1 A 2 5 8 3 X K John
4 1 A 2 5 8 4 X L Sam
5 2 B 2 3 4 1 X L Sam
9 3 C 2 0 0 1 X K John
10 3 C 2 8 10 2 Y M John
11 3 C 2 8 10 3 Y K John
12 3 C 2 0 0 4 Y K John
13 5 E 2 0 0 1 Y M Sam
14 5 E 2 5 5 2 Y L Sam
15 5 E 2 5 9 3 Y M Sam
16 5 E 2 0 0 4 Z M Kyle
17 5 E 2 5 8 5 Z L Kyle
18 5 E 2 5 8 6 Z M Kyle
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.