Delete rows based on a match between two different columns in a data frame

Question

I have a data frame with daily channel revenue from multiple channels. The data frame looks like the following:

orders_dataframe:

    Order |Channel | Revenue |
    1     |TV      | 120     |
    2     |Email   | 30      |
    3     |Retail  | 300     |
    4     |Shop1   | 50      |
    5     |Shop2   | 90      |
    6     |Email   | 20      |
    7     |Retail  | 250     |

What I would like to do is to split those revenues coming from Retail and divide them between Shop1 and Shop2 according to a predefined ratio (eg, 60%/40% split). For example, I would like that all rows with revenue coming from "Retail" get attributed 60% to Shop1 and 40% to Shop2. This can be reflected by replacing all retail-revenue rows with two new rows, as seen for Order 3 and Order 7 in the final table I want to get below:

orders_dataframe:  

    Order |Channel | Revenue |
    1     |TV      | 120     |
    2     |Email   | 30      |
    3     |Shop1   | 180     |
    3     |Shop2   | 120     |
    4     |Shop1   | 50      |
    5     |Shop2   | 90      |
    6     |Email   | 20      |
    7     |Shop1   | 150     |
    7     |Shop2   | 100     |

Ideally, since I am performing this with various datasets, I would like to take the percentages from a data frame (split_dataframe) instead of manually assigning the figures 60% and 40%. I would like to use the figures from a dataset like below:

split_dataframe:
    Channel |Percent  |
    Shop1   |60%      | 
    Shop2   |40%      |

Here is a reproducible example of the two data frames:

orders_dataframe <- data.frame(Order = c(1,2,3,4,5,6,7),
                              Channel = c("TV", "Email", "Retail", "Shop1", "Shop2", "Email", "Retail"), 
                              Revenue = c(120,30,300,50,90,20,250))

split_dataframe <- data.frame(Channel = c("Shop1", "Shop2"),
                              Percent = c(0.6, 0.4))

Thank you very much!

Answer 1

With dplyr ,

split_dataframe  %>% 
mutate(Index="Retail") %>%
merge(.,orders_dataframe,by.x="Index",by.y="Channel") %>%
mutate(Revenue=Revenue*Percent) %>%
select(Order,Channel,Revenue) %>%
bind_rows(orders_dataframe %>% filter(Channel !="Retail"),.)%>%
arrange(.,Order)

gives,

  Order Channel Revenue
1     1      TV     120
2     2   Email      30
3     3   Shop1     180
4     3   Shop2     120
5     4   Shop1      50
6     5   Shop2      90
7     6   Email      20
8     7   Shop1     150
9     7   Shop2     100

Answer 2

Here is a data.table approach... see comments in code for explanation

library( data.table )
#make them data.tables
setDT( orders_dataframe ); setDT( split_dataframe )
#split to retail en non-retail orders
orders_retail    <- orders_dataframe[ Channel == "Retail", ]
orders_no_retail <- orders_dataframe[ !Channel == "Retail", ]
#divide the retail orders over the two shops (multiple steps)
#create a new colum by shop
shop_cols <- split_dataframe$Channel
orders_retail[, (shop_cols) := Revenue ]
#melt to long format
orders_retail.melt <- melt( orders_retail, 
                            id.vars = "Order", 
                            measure.vars = (shop_cols),
                            variable.name = "Channel",
                            value.name = "Revenue")
#and update the molten data with the percentages in the split_dataframe
orders_retail.melt[ split_dataframe, 
                    Revenue := Revenue * i.Percent,
                    on = .( Channel )]
#merge everything back together and order on Order id
ans <- rbind( orders_no_retail, orders_retail.melt )
setorder( ans, Order )
#    Order Channel Revenue
# 1:     1      TV     120
# 2:     2   Email      30
# 3:     3   Shop1     180
# 4:     3   Shop2     120
# 5:     4   Shop1      50
# 6:     5   Shop2      90
# 7:     6   Email      20
# 8:     7   Shop1     150
# 9:     7   Shop2     100

Answer 3

You can do this in base R.

orders_dataframe <- data.frame(Order = c(1,2,3,4,5,6,7),
                               Channel = c("TV", "Email", "Retail", "Shop1", "Shop2", "Email", "Retail"), 
                               Revenue = c(120,30,300,50,90,20,250))

# Coerce the channel factor to a string.
# Do you really want this as a factor?
orders_dataframe$Channel <- as.character(orders_dataframe$Channel)

# Create a vector of the replacement values.
# The prob = c() argument lets you pick the
# probabilities of each replacement.
replacement <- sample(x = c("Store1","Store2"),
                      size = length(which(orders_dataframe$Channel == "Retail")),
                      replace = TRUE, prob = c(0.6, 0.4))

# Replace the Channel columnn with the replacement vector.
orders_dataframe$Channel[which(orders_dataframe$Channel == "Retail")] <- replacement

Delete rows based on a match between two different columns in a data frame

Question

3 answers

solution1
1 ACCPTED 2020-08-12 17:41:33

solution2
0 2020-08-12 17:32:12

solution3
0 2020-08-12 17:47:37

Delete rows based on a match between two different columns in a data frame

Question

3 answers

solution1 1 ACCPTED 2020-08-12 17:41:33

solution2 0 2020-08-12 17:32:12

solution3 0 2020-08-12 17:47:37

solution1
1 ACCPTED 2020-08-12 17:41:33

solution2
0 2020-08-12 17:32:12

solution3
0 2020-08-12 17:47:37