简体   繁体   中英

Compare rows in groups in the same data frame

My data looks like this:

library(dplyr)
library(data.table)

df <- data.frame(
  customernumber = c("111", "111", "111",  "111", "111","222", "222", "222", "222", "222", "222", "222"), 
  ordernumber = c("1", "1", "1", "2", "2", "1", "1", "1", "1", "2", "2", "3"), 
  article = c("JeansA", "JeansA", "ShirtA", "JeansA", "JeansB", "ShirtA", "ShirtB", "ShirtB", "JeansA", "JeansB", "ShirtA", "JeansB"), 
  size = c("40", "42", "40", "42", "44", "36", "36", "40", "40", "38", "44", "36"), 
  returned = c("1", "1", "0", "0", "1", "1", "1", "0", "0", "0", "0", "0")
)

Output:

   customernumber ordernumber article size returned
1             111           1  JeansA   40        1
2             111           1  JeansA   42        1
3             111           1  ShirtA   40        0
4             111           2  JeansA   42        0
5             111           2  JeansB   44        1
6             222           1  ShirtA   36        1
7             222           1  ShirtB   36        1
8             222           1  ShirtB   40        0
9             222           1  JeansA   40        0
10            222           2  JeansB   38        0
11            222           2  ShirtA   44        0
12            222           3  JeansB   36        0

Now I want to mark all orders per customer, for which an article has been returned, but ordered again in the next order in a different size. Thus, all articles that are only exchanged and can therefore not truly be seen as a new order. So the end results is supposed to look like this:

Result:

   customernumber ordernumber article size returned changed
1             111           1  JeansA   40        1       0
2             111           1  JeansA   42        1       0
3             111           1  ShirtA   40        0       0
4             111           2  JeansA   42        0       1
5             111           2  JeansB   44        1       0
6             222           1  ShirtA   36        1       0
7             222           1  ShirtB   36        1       0
8             222           1  ShirtB   40        0       0
9             222           1  JeansA   40        0       0
10            222           2  JeansB   38        0       0
11            222           2  ShirtA   44        0       1
12            222           3  JeansB   36        0       0

I thought I could sove the problem by introducing a lag variable using dyplr (or data.table), but I only manage to lag the variable within the same group but I fail to lag it into the next group. This is:

df %>% 
  group_by(customernumber, ordernumber, article) %>% 
  mutate(lag_size = lag(size, order_by = article))

or:

df <- data.table(df)
setorder(df, customernumber, ordernumber, article)
df[,lag_size := shift(size), by = .(customernumber, ordernumber, article)]

I don't want to think about a for loop (not even sure whether it would solve the problem), since the data set is quite big and it will take for ages.And I am overall really lacking ideas. So any help is appreciated.

Thanks!



AddOn:

I stumbled into another issue related to this case. I only want to mark articles that have been orderer in another size in the next follow up order as changed and not if the same article in the same size has been orderer again. So the criterium for the variable changed would be:

Order n: returned == 1 Order n+1: same article, different size --> changed == 1 (otherwise changed == 0)

Here is the updated example:

df <- data.frame(
 customernumber = c("111", "111", "111",  "111", "111", "111","222", "222", "222", "222", "222", "222", "222"), 
 ordernumber = c("1", "1", "1", "2", "2", "2", "1", "1", "1", "1", "2", "2", "3"), 
 article = c("JeansA", "JeansA", "ShirtA", "JeansA", "JeansA", "JeansB", "ShirtA", "ShirtB", "ShirtB", "JeansA", "JeansB", "ShirtA", "JeansB"), 
 size = c("40", "42", "40", "40", "44", "44", "36", "36", "40", "40", "38", "44", "36"), 
 returned = c("1", "1", "0", "0", "1", "1", "1", "1", "0", "0", "0", "0", "0")
)

Output:

   customernumber ordernumber article size returned
1             111           1  JeansA   40        1
2             111           1  JeansA   42        1
3             111           1  ShirtA   40        0
4             111           2  JeansA   40        0
5             111           2  JeansA   44        1
6             111           2  JeansB   44        1
7             222           1  ShirtA   36        1
8             222           1  ShirtB   36        1
9             222           1  ShirtB   40        0
10            222           1  JeansA   40        0
11            222           2  JeansB   38        0
11            222           2  ShirtA   44        0
12            222           3  JeansB   36        0

Result:

   customernumber ordernumber article size returned changed
1             111           1  JeansA   40        1       0
2             111           1  JeansA   42        1       0
3             111           1  ShirtA   40        0       0
4             111           2  JeansA   40        0       0
5             111           2  JeansA   44        1       1
6             111           2  JeansB   44        1       0   
7             222           1  ShirtA   36        1       0
8             222           1  ShirtB   36        1       0
9             222           1  ShirtB   40        0       0
10            222           1  JeansA   40        0       0
11            222           2  JeansB   38        0       0
11            222           2  ShirtA   44        0       1
12            222           3  JeansB   36        0       0

Sorry for the confusion, I actually made a mistake in my example and filled the changed variable incorrectly. If you are still up helping me, I would appreciate it very much.

Thank you!

New answer:

A possible solution with data.table :

library(data.table)
setDT(df)

df[, changed := 0
   ][df[df, on = .(customernumber, ordernumber < ordernumber, article), nomatch = 0
        ][size != i.size & returned == 1, .SD[!i.size %in% size], by = .(customernumber, ordernumber, article)
          ][, .(customernumber, ordernumber, article, size = i.size)][, unique(.SD)]
     , on = .(customernumber, ordernumber, article, size), changed := 1][]

which gives:

  customernumber ordernumber article size returned changed 1: 111 1 JeansA 40 1 0 2: 111 1 JeansA 42 1 0 3: 111 1 ShirtA 40 0 0 4: 111 2 JeansA 40 0 0 5: 111 2 JeansA 44 1 1 6: 111 2 JeansB 44 1 0 7: 222 1 ShirtA 36 1 0 8: 222 1 ShirtB 36 1 0 9: 222 1 ShirtB 40 0 0 10: 222 1 JeansA 40 0 0 11: 222 2 JeansB 38 0 0 12: 222 2 ShirtA 44 0 1 13: 222 3 JeansB 36 0 0 

Old answer:

library(data.table)
setDT(df)

df[df[returned == 0][df[returned == 1]
                     , on = .(customernumber, article)
                     ][ordernumber != i.ordernumber]
   , on = .(customernumber, article, returned)
   , changed := i.returned
   ][, changed := replace(changed, is.na(changed), 0)][]

which gives:

  customernumber ordernumber article size returned changed 1: 111 1 JeansA 40 1 0 2: 111 1 JeansA 42 1 0 3: 111 1 ShirtA 40 0 0 4: 111 2 JeansA 42 0 1 5: 111 2 JeansB 44 1 0 6: 222 1 ShirtA 36 1 0 7: 222 1 ShirtB 36 1 0 8: 222 1 ShirtB 40 0 0 9: 222 1 JeansA 40 0 0 10: 222 2 JeansB 38 0 0 11: 222 2 ShirtA 44 0 1 12: 222 3 JeansB 36 0 0 

You are working on more than one lag condition, so we need more than one lag commands to create that condition. We can then use case_when to create the changed column.

df2 <- df %>%
  group_by(customernumber, article) %>%
  mutate(lag_returned = lag(returned),
         lag_ordernumber = lag(ordernumber)) %>%
  ungroup() %>%
  mutate(changed = case_when(
    returned %in% "0" & 
      duplicated(article) & 
        lag_returned %in% "1" &
          ordernumber != lag_ordernumber ~ "1",
    TRUE                                 ~ "0"
  )) %>%
  select(-starts_with("lag"))

df2
# # A tibble: 12 x 6
#    customernumber ordernumber article size  returned changed
#    <fct>          <fct>       <fct>   <fct> <fct>    <chr>  
#  1 111            1           JeansA  40    1        0      
#  2 111            1           JeansA  42    1        0      
#  3 111            1           ShirtA  40    0        0      
#  4 111            2           JeansA  42    0        1      
#  5 111            2           JeansB  44    1        0      
#  6 222            1           ShirtA  36    1        0      
#  7 222            1           ShirtB  36    1        0      
#  8 222            1           ShirtB  40    0        0      
#  9 222            1           JeansA  40    0        0      
# 10 222            2           JeansB  38    0        0      
# 11 222            2           ShirtA  44    0        1      
# 12 222            3           JeansB  36    0        0 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM