[英]Applying a custom function on data.table instead of using plyr and ddply
我正在處理一個名為orderFlow的data.table,並計算potentialWelfare.tmp作為輸出。 到目前為止,以下基於plyr的方法一直是我的解決方案,但由於輸入orderFlow有數百萬行,我更喜歡利用R中data.table的性能的解決方案。
# solution so far, poor performance on huge orderFlow input data.table
require(plyr)
potentialWelfare.tmp = ddply(orderFlow,
.variables = c("simulationrun_id", "db"),
.fun = calcPotentialWelfare,
.progress = "text",
.parallel=TRUE)
編輯1:簡而言之,自定義功能檢查df中是否有更多出價或要求, 並對按報價(按估值)出價的NbAsks的估值求和。 這樣做是為了選擇最有價值的出價並總結其估值。 代碼是遺留的,可能效率不高,但它與plyr和普通的data.frames結合使用。
calcPotentialWelfare <- function(df){
NbAsks = dim(df[df$type=="ask",])[1]
# print(NbAsks)
Bids = df[df$type == "bid",]
# dd[with(dd, order(-z, b)), ]
Bids = Bids[with(Bids,order(valuation,decreasing = TRUE)),]
NbBids = dim(df[df$type == "bid",])[1]
# print(Bids)
if (NbAsks > 0){
Bids = Bids[1:min(NbAsks,NbBids),]
potentialWelfare = sum(Bids$valuation)
return(potentialWelfare)
}
else{
potentialWelfare = 0
return(potentialWelfare)
}
}
不幸的是,我找不到使用data.table實現這個的有效方法。 到目前為止,我使用?data.table和相應的常見問題解答得到的是:
# trying to use data.table, but it doesn't work so far.
potentialWelfare.tmp = orderFlow[, lapply(.SD, calcPotentialWelfare), by = list(simulationrun_id, db),.SDcols=c("simulationrun_id", "db")]
我得到的是
Error in `[.data.frame`(orderFlow, , lapply(.SD, calcPotentialWelfare), : unused arguments (by = list(simulationrun_id, db), .SDcols = c("simulationrun_id", "db"))
這是輸入:
> head(orderFlow)
type valuation price dateCreation dateDue dateMatched id
1 ask 0.30000000 0.3 2012-01-01 00:00:00.000000 2012-01-01 00:30:00.000000 2012-01-01 00:01:01.098307 1
2 bid 0.39687633 0.0 2012-01-01 00:01:01.098307 2012-01-01 00:10:40.024807 2012-01-01 00:01:01.098307 2
3 bid 0.96803384 NA 2012-01-01 00:03:05.660811 2012-01-01 00:06:26.368941 <NA> 3
4 bid 0.06163186 NA 2012-01-01 00:05:25.413959 2012-01-01 00:09:06.189893 <NA> 4
5 bid 0.57017143 NA 2012-01-01 00:10:10.344876 2012-01-01 00:57:58.998516 <NA> 5
6 bid 0.37188442 NA 2012-01-01 00:11:25.761372 2012-01-01 00:43:24.274176 <NA> 6
created_at updated_at simulationrun_id db
1 2013-12-10 14:37:29.065634 NA 7004 1
2 2013-12-10 14:37:29.065674 NA 7004 1
3 2013-12-10 14:37:29.065701 NA 7004 1
4 2013-12-10 14:37:29.065726 NA 7004 1
5 2013-12-10 14:37:29.065750 NA 7004 1
6 2013-12-10 14:37:29.065775 NA 7004 1
我期待像這樣的輸出,即函數calcPotentialWelfare以某種特殊的方式從data.table orderFlow的列'valu'聚合數據。
> head(potentialWelfare.tmp)
simulationrun_id db potentialWelfare
1 1 1 16.86684
2 2 1 18.44314
3 4 1 16.86684
4 5 1 18.44314
5 7 1 16.86684
6 8 1 18.44314
真的很高興看到這個問題得到解決。 謝謝閱讀!
EDIT2:
> dput(head(orderFlow))
structure(list(type = c("ask", "bid", "bid", "bid", "bid", "bid"
), valuation = c(0.3, 0.39687632952068, 0.968033835246625, 0.0616318564942726,
0.570171430446081, 0.371884415116724), price = c(0.3, 0, NA,
NA, NA, NA), dateCreation = c("2012-01-01 00:00:00.000000", "2012-01-01 00:01:01.098307",
"2012-01-01 00:03:05.660811", "2012-01-01 00:05:25.413959", "2012-01-01 00:10:10.344876",
"2012-01-01 00:11:25.761372"), dateDue = c("2012-01-01 00:30:00.000000",
"2012-01-01 00:10:40.024807", "2012-01-01 00:06:26.368941", "2012-01-01 00:09:06.189893",
"2012-01-01 00:57:58.998516", "2012-01-01 00:43:24.274176"),
dateMatched = c("2012-01-01 00:01:01.098307", "2012-01-01 00:01:01.098307",
NA, NA, NA, NA), id = 1:6, created_at = c("2013-12-10 14:37:29.065634",
"2013-12-10 14:37:29.065674", "2013-12-10 14:37:29.065701",
"2013-12-10 14:37:29.065726", "2013-12-10 14:37:29.065750",
"2013-12-10 14:37:29.065775"), updated_at = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), simulationrun_id = c(7004L,
7004L, 7004L, 7004L, 7004L, 7004L), db = c(1L, 1L, 1L, 1L,
1L, 1L)), .Names = c("type", "valuation", "price", "dateCreation",
"dateDue", "dateMatched", "id", "created_at", "updated_at", "simulationrun_id",
"db"), row.names = c(NA, 6L), class = "data.frame")
我認為這應該更快。 您使用data.table
的方式有一些錯誤。 我建議你仔細閱讀介紹,通過實例,閱讀常見問題解答。
calcPotentialWelfare <- function(dt){
NbAsks = nrow(dt["ask", nomatch=0L]) # binary search based subset/join - very fast
Bids = dt["bid", nomatch=0L] # binary search based subset/join - very fast
NbBids = nrow(Bids)
# for each 'type', the 'valuation' will always be sorted,
# but in ascending order - but you need descending order
# so you can just use the function 'tail' to fetch the last 'n' items... as follows.
if (NbAsks > 0) return(sum(tail(Bids, min(NbAsks, NbBids))$valuation))
else return(0)
}
# setkey on 'type' column to use binary search based subset/join in the function
# also on valuation so that we don't have to 'order' for every group
# inside the function - we can use 'tail'
setkey(orderFlow, type, valuation)
potentialWelfare.tmp =
orderFlow[, calcPotentialWelfare(.SD),
by=.(simulationrun_id, db),
.SDcols=c("type", "valuation")]
.SD
是一個特殊變量,它為每個分組創建一個data.table,其中所有列都沒有在by=
提及(如果未指定.SDcols
)。 如果指定了.SDcols
, .SD
每個groupw創建.SD,僅指定那些列,並且數據對應於該組。
使用lapply(.SD, ...)
為每個列提供函數,這不是您需要的。 您需要將整個數據發送到該函數。 但是,由於您只需要函數中的“類型”和“評估”列,您可以通過提供.SDcols=c('type', 'valuation')
來加快速度。 通過忽略其他列,這將節省大量時間。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.