[英]speed up R apply on data frame
我有一個賽馬數據集。 對於每條賽馬記錄,如果賽道值不缺失,我想統計近兩年在相同場地、賽道、相似距離的情況下獲勝的馬匹數量。 我使用apply來循環每一行,我只想在原始數據集中添加一個新的列數,因此output應該多一列,並且與給定的輸入相同的行長。 但是速度非常慢。 如何加快循環?
rdate:賽馬年-月-日。 場地:ST,HV。 軌道:草坪,全天候軌道。 距離:1200、1400、1600、1800 等 ind_win:0(馬未獲得第一名),1(馬獲得第一名)。
structure(list(rdate = structure(c(17450, 17475, 17481, 17496,
17510, 17517, 17532, 17566, 17593, 17615, 17629, 17657, 17667,
17796, 17817, 17839, 17856, 17860, 17881, 17881, 17902, 17902
), class = "Date"), venue = c("HV", "ST", "ST", "ST", "ST", "ST",
"ST", "ST", "ST", "ST", "ST", "ST", "HV", "ST", "ST", "ST", "HV",
"ST", "ST", "ST", "ST", "ST"), track = c("TURF", "TURF", "TURF",
"TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF",
"TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF",
"TURF", "TURF", "TURF"), horsenum = c("A366", "A366", "A366",
"A366", "A366", "A366", "A366", "A366", "A366", "A366", "A366",
"A366", "A366", "B440", "B440", "B440", "A366", "B440", "A366",
"B440", "A366", "B440"), distance = c(1800L, 1800L, 1600L, 1600L,
1800L, 1600L, 1800L, 1800L, 1800L, 1600L, 1800L, 2000L, 1800L,
1200L, 1400L, 1400L, 1650L, 1400L, 1600L, 1400L, 1800L, 1400L
), ind_win = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L)), row.names = c(NA, -22L
), class = "data.frame")
library(tidyverse)
library(lubridate)
HWinCountF <- function(df){
if (!is.na(df["track"])) {
tmp <- subset(jc.data, horsenum == df["horsenum"] & rdate < df["rdate"] & rdate > ymd(df["rdate"]) - years(2) &
venue == df["venue"] & track==df["track"] & distance>=as.integer(df["distance"])-200 &
distance<=as.integer(df["distance"])+200)
if (nrow(tmp) > 0) {
return(nrow(tmp[tmp$ind_win == 1,]))
} else {
return(NA)
}
} else {
return(NA)
}
}
jc.data['h_win_count'] <- apply(jc.data, 1, HWinCountF)
根據OP的需要
我想統計過去兩年在相同場地、賽道、相似距離的情況下獲勝的馬匹數
由於這是直接聚合,請避免循環並考慮與數據框的merge
subset
,因為您似乎需要相互比較觀察結果。 然后為馬獲勝運行aggregate
。 下面運行發布的數據樣本。
# MERGE BY COMMON VARIABLES AND SUBSET RESULTS BY DATE AND DISTANCE
compare_df <- subset(merge(jc.data, jc.data, by=c("horsenum", "venue", "track")),
rdate.x < rdate.y &
rdate.x > lubridate::ymd(rdate.y) - lubridate::years(2) &
distance.x >= as.integer(distance.y) - 200 &
distance.x <= as.integer(distance.y) + 200
)
# SUM ind_win GROUPED BY COMMON VARIABLES
agg_df <- aggregate(cbind(h_win_count = ind_win.x) ~ horsenum + venue + track,
data = compare_df, FUN=sum)
agg_df
# horsenum venue track h_win_count
# 1 A366 HV TURF 0
# 2 A366 ST TURF 0
# 3 B440 ST TURF 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.