简体   繁体   中英

Calculating support of an amount based on a threshold in R data.table

Lets say I have a data.table which looks like this

    customer TranAmount
 1:   146506    1290.49
 2:   146506    2699.00
 3:   146506    2720.00
 4:   146506    2700.00
 5:   146506       6.35
 6:   146506    2700.00
 7:   146506    2705.00
 8:   146506    2691.00
 9:   146506     500.00
10:   146506     500.95
11:   146506      52.00

Now I want to calculate support of each amount, by support I mean if I pick a transaction and there are transactions which is within the threshold of that transaction, then support of that transaction is equal to the number of transaction within that limit.

For example in the above data, for TranAmount 2700.00, if we consider a threshold of 1% (above 2700.00 or below 2700.00) then there are 6 transactions within that range, so support for TranAmount 2700.00 is 6

I have written this function which does that but its slow and certainly not in data.table way but it does the job, I am sure there are better ways to achieve this, but I cant think of any.

get_support <- function(dt,val_tolerance=0.01) {
    support_dt <- dt[,.(customer,TranAmount)][order(TranAmount)]
    support_dt[,support:= 0]

    for(i in 1:nrow(support_dt)) {
      start <- support_dt[i,TranAmount]
      current_support <- support_dt[i,support]
      amount_limit <- c((start - start*val_tolerance),(start + start*val_tolerance))
      for (j in 1:nrow(support_dt)){
        amount <- support_dt[j,TranAmount]
        if(between(amount,amount_limit[1],amount_limit[2]) ==TRUE ){
          current_support <- current_support+1
        }else{
          current_support <- current_support
        }
      }
      #print(current_support)
      support_dt[i,support:=current_support]
    } 
    print(support_dt)
}

Please suggest better way to achieve the same.

With version 1.9.7 or later, use non-equi joins:

vals = c(2700, 500)

DT[.(dn = vals*0.99, up = vals*1.01), on=.(TranAmount >= dn, TranAmount <= up), 
  .N
, by=.EACHI]

#    TranAmount TranAmount N
# 1:       2673       2727 6
# 2:        495        505 2

The column names in the result aren't very intuitive, but those might change.

Currently (Aug 2016), you'll need to install the devel version for this.

I got warnings but apparently the logic was sound:

 dt[ , support := ave(TranAmount, TranAmount,
                           FUN= function(x) sum(abs(x -dt$TranAmount) < 0.01*x) ) ]
 #---------------------------------
Warning messages:
1: In x - dt$TranAmount :
  longer object length is not a multiple of shorter object length
2: In abs(x - dt$TranAmount) < 0.01 * x :
  longer object length is not a multiple of shorter object length
> dt
    customer TranAmount support
 1:   146506    1290.49       1
 2:   146506    2699.00       6
 3:   146506    2720.00       5
 4:   146506    2700.00       6
 5:   146506       6.35       1
 6:   146506    2700.00       6
 7:   146506    2705.00       6
 8:   146506    2691.00       5
 9:   146506     500.00       2
10:   146506     500.95       2
11:   146506      52.00       1

Here is a hacky data.table solution that works (there is likely a cleaner way though)

tvals <- df$TranAmount
pct   <- 0.01

dfDT[, id := .I][,support := sum( TranAmount*(1-pct) <= tvals & tvals <= TranAmount*(1+pct) ), by = list(id)][,id:=NULL]

EDIT : an alternate data.table approach

dfDT[, support := sum( TranAmount*(1-pct) <= tvals & tvals <= TranAmount*(1+pct) ), by = rownames(dfDT)]

and here is a dplyr solution

df %>%
    group_by(rn=row_number()) %>%
    mutate(support = sum( TranAmount*(1-pct) <= tvals & tvals <= TranAmount*(1+pct) ) ) %>%
    ungroup %>%
    select(-rn)

##    customer TranAmount support
##       <int>      <dbl>   <int>
## 1    146506    1290.49       1
## 2    146506    2699.00       6
## 3    146506    2720.00       5
## 4    146506    2700.00       6
## 5    146506       6.35       1
## 6    146506    2700.00       6
## 7    146506    2705.00       6
## 8    146506    2691.00       5
## 9    146506     500.00       2
## 10   146506     500.95       2
## 11   146506      52.00       1

Note, df ( data.frame ) and dfDT ( data.table ) contain the same data.

Here's a relatively simple data.table way

x[, support := x[abs(TranAmount- .SD[,TranAmount]) < 0.01*.SD[,TranAmount] , .N], 
    by=1:NROW(x)]

#    customer  TranAmount support
# 1:   146506 1290.49       1
# 2:   146506 2699.00       6
# 3:   146506 2720.00       5
# 4:   146506 2700.00       6
# 5:   146506    6.35       1
# 6:   146506 2700.00       6
# 7:   146506 2705.00       6
# 8:   146506 2691.00       5
# 9:   146506  500.00       2
#10:   146506  500.95       2
#11:   146506   52.00       1

The data:

x = data.table(customer=rep(146506,11), 
               TranAmount=c(1290.49, 2699.00, 2720.00, 2700.00, 6.35, 2700.00,
                            2705.00, 2691.00, 500.00, 500.95, 52.00))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM