简体   繁体   中英

R: data.table custom function on subset

Here a minimal working example of my data:

library(data.table)
df <- data.table(date=as.Date(c("1999-01-04","1999-01-04","1999-01-04","1999-01-04","1999-01-04","1999-01-04","1999-01-04","1999-01-04")),volume=c(1000,1000,1200,1250,1200,1300,1250,1200),cp_flag=c("P","C","C","P","C","C","P","P"),price_in=c(10.1,120.4,100.3,0.1,90.2,45.7,99.1,7.4), price_out=c(12.4,122.1,102.0,0.6,99.1,48.1,100.0,8.1), dtm=c(10,10,12,12,12,15,15,12))
setorder(df,date,dtm,volume)
df
         date volume cp_flag price_in price_out dtm
1: 1999-01-04   1000       P     10.1      12.4  10
2: 1999-01-04   1000       C    120.4     122.1  10
3: 1999-01-04   1200       C    100.3     102.0  12
4: 1999-01-04   1200       P      7.4       8.1  12
5: 1999-01-04   1200       C     90.2      99.1  13
6: 1999-01-04   1250       P      0.1       0.6  12
7: 1999-01-04   1250       P     99.1     100.0  15
8: 1999-01-04   1300       C     45.7      48.1  15

My goal is now: for each date I want to compute a customary function for all items with the same 1) volume and 2)dtm (aka duration), depending on whether it is a "C" or a "P" product, for example: volume/10+price_in[cp_flag=="C"]-price_out[cp_flag=="P"] .

An additional layer of difficulty here is that there may be a different number of "P"s and "C"s for each date/volume/dtm combination (for example see volume = 1200), which I want to treat as described below.

As an output I look for

         date volume dtm
1: 1999-01-04   1000  10
2: 1999-01-04   1200  12
3: 1999-01-04   1200  13
4: 1999-01-04   1250  12
5: 1999-01-04   1250  15
6: 1999-01-04   1300  15

with an additional column representing the result of the function above and where the length of the table is determined by all date/volume/dtm in the following manner:

  1. if there is exactly 1 "P" product and 1 "C" product (as in the first combination), then the computation of the formula is straightforward and 1 result is obtained
  2. if there are several "P" products and 1 "C" product, or the other way around, then the result column has an entry for all cross-combinations between the "P" and "C" products
  3. same if there are several "P" products and several "C" products, ie all possible cross combinations between "C"s and "P"s are taken
  4. if there is only one product/category (eg C), the function is performed with the price_in[cp_flag="C"] and price_out[cp_flag="C"] value of the same products, and the length of the result is the same as in the original table

I believe that this can be done efficiently via the data.table methods, but I don't quite get it to work. It seems natural to operate over .SD s. So I first tried to expand each subset via

df[,print(.SD),by=.(date,volume,dtm),.SDcols=c("price_in","price_out","volume","cp_flag")]

This gives me all the wanted combinations:

   price_in price_out cp_flag
1:     10.1      12.4       P
2:    120.4     122.1       C
   price_in price_out cp_flag
1:    100.3     102.0       C
2:      7.4       8.1       P
   price_in price_out cp_flag
1:     90.2      99.1       C
   price_in price_out cp_flag
1:      0.1       0.6       P
   price_in price_out cp_flag
1:     99.1       100       P
   price_in price_out cp_flag
1:     45.7      48.1       C

But now I am not sure how to compute the customary function, ie check how many "C"s and "P"s there are in each group and then computing above's formula, ie volume/10+price_in[cp_flag=="C"]-price_out[cp_flag=="P"] for all C's and P's. But in case there are only Cs or Ps use only their information, ie same product for price_in and price_out. For the first part, I have tried something like

df[,lapply(.SD,function(x) x[cp_flag=="C",volume/10]+x[cp_flag=="C",price_in]-x[cp_flag=="P",price_out]),by=.(date,volume,dtm),.SDcols=c("price_in","price_out","volume","cp_flag")]

but this fails as I apparently seem to misunderstand how to use a custom function in this case.

Question : How do I properly use such a custom function on a subset of a data table with such additional case structure?

Note: I know this example looks quite complicated, and maybe I am too deep already and might have spent too much time cracking it, but I can't see an easier way to present my issue. If there is any further clearance I can give, please let me know. Any help is highly appreciated!

I guess this is it:

res = df[, {
  flags   = sort(unique(cp_flag))
  n_flags = length(flags)

  if (n_flags == 1L) 
    .(g = .GRP, price_in, price_out, flags = flags) 

  else CJ(
    g = .GRP,
    price_in = price_in[cp_flag == "C"], 
    price_out = price_out[cp_flag == "P"], 
    flags = toString(flags)
  )
}, by=.(date, volume, dtm)][, v := volume/10 + price_in - price_out][]

         date volume dtm g price_in price_out flags     v
1: 1999-01-04   1000  10 1    120.4      12.4  C, P 208.0
2: 1999-01-04   1200  12 2     90.2       8.1  C, P 202.1
3: 1999-01-04   1200  12 2    100.3       8.1  C, P 212.2
4: 1999-01-04   1250  12 3      0.1       0.6     P 124.5
5: 1999-01-04   1250  15 4     99.1     100.0     P 124.1
6: 1999-01-04   1300  15 5     45.7      48.1     C 127.6

I wouldn't say this is efficient, but at least the calculations are done in a vectorized way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM