简体   繁体   中英

R Sequence by columns and random sequence

Say I have a data frame of customers -

cust_df = 
Date      ArrivalTime    TimeInStore     AmountSpent
170920     930             30               20
170920     1000            20               20
170920     1001            30               100
170920     1500            15               10
170921     1030            10               200
170921     1111            25               50
170921     1900            10               75

I want to do 2 different actions: 1. Check how much time and money the 3 first customers on each day spend 2. Compare that to random 3 customers from each day (they can be within the first three or not) If during that day there were less than 3 customers, I want to include all customers from that day.

What is the most efficient way to do so?

Currently my code is:

cust_df <- cust_df[order(cust_df$Date, cust_df$ArrivalTime),] #order by time
cust_df_by_Date <- split(cust_df,f = cust_df$Date) #split to dates
cust_num <- sapply(cust_df_by_Date,function(x) dim(x)[1]) #find num of customers per day
first_cust_df <- c()
i <- 1
for(num in cust_num ){
    if(num>=3){
        first_cust_df <- rbind(first_cust_df,cust_df_by_Date[[i]][1:3,])
    }
    else{
        first_cust_df <- rbind(first_cust_df,cust_df_by_Date[[i]][1:num,])
    }
    i <- i+1
}

And for the random part:

rand_cust_sampling_df <- ldply(cust_df_by_Date,function(x) x[sample(1:dim(x)[1],ifelse(dim(x)[1]>=3,3,dim(x)[1])),])

I'm quite sure that there is a more efficient way to do so, but I'm new to this language and couldn't find an answer to this specific question.

Thanks!

The dplyr package can help you here.

install.packages("dplyr")
library(dplyr)

To get the first 3 customers on a day, group_by Date then slice :

cust_df %>% 
  group_by(Date) %>% 
  slice(1:3)

Not clear from your question how you want to summarise time and spending but you could sum, for example, like this:

cust_df %>% 
  group_by(Date) %>% 
  slice(1:3) %>% 
  summarise(sumSpent = sum(AmountSpent))

    Date sumSpent
   <int>    <int>
1 170920      140
2 170921      325

You can randomly select 3 customers by date using sample_n :

cust_df %>% 
  group_by(Date) %>% 
  sample_n(3) %>% 
  summarise(sumSpent = sum(AmountSpent))

    Date sumSpent
   <int>    <int>
1 170920      130
2 170921      325

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM