简体   繁体   English

加速 R 应用于数据帧

[英]speed up R apply on data frame

I have a horse racing dataset.我有一个赛马数据集。 For each horse racing record, if the track value is not missing, I want to count the number of horse win in the past two years with the same venue, track, and similar distance.对于每条赛马记录,如果赛道值不缺失,我想统计近两年在相同场地、赛道、相似距离的情况下获胜的马匹数量。 I use apply to loop each row, I only want to add a new column count previous win to the original data set, so the output should have one more column and same row length as the input given.我使用apply来循环每一行,我只想在原始数据集中添加一个新的列数,因此output应该多一列,并且与给定的输入相同的行长。 But the speed is very slow.但是速度非常慢。 How can I speed up the loop?如何加快循环?

rdate: horse racing year-month-date. rdate:赛马年-月-日。 venue: ST, HV.场地:ST,HV。 track: TURF, All WEATHER TRACK.轨道:草坪,全天候轨道。 distance: 1200, 1400, 1600, 1800, etc. ind_win: 0(horse did not win 1st place), 1(horse win 1st place).距离:1200、1400、1600、1800 等 ind_win:0(马未获得第一名),1(马获得第一名)。

structure(list(rdate = structure(c(17450, 17475, 17481, 17496, 
17510, 17517, 17532, 17566, 17593, 17615, 17629, 17657, 17667, 
17796, 17817, 17839, 17856, 17860, 17881, 17881, 17902, 17902
), class = "Date"), venue = c("HV", "ST", "ST", "ST", "ST", "ST", 
"ST", "ST", "ST", "ST", "ST", "ST", "HV", "ST", "ST", "ST", "HV", 
"ST", "ST", "ST", "ST", "ST"), track = c("TURF", "TURF", "TURF", 
"TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", 
"TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", 
"TURF", "TURF", "TURF"), horsenum = c("A366", "A366", "A366", 
"A366", "A366", "A366", "A366", "A366", "A366", "A366", "A366", 
"A366", "A366", "B440", "B440", "B440", "A366", "B440", "A366", 
"B440", "A366", "B440"), distance = c(1800L, 1800L, 1600L, 1600L, 
1800L, 1600L, 1800L, 1800L, 1800L, 1600L, 1800L, 2000L, 1800L, 
1200L, 1400L, 1400L, 1650L, 1400L, 1600L, 1400L, 1800L, 1400L
), ind_win = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L)), row.names = c(NA, -22L
), class = "data.frame")

library(tidyverse)
library(lubridate)

HWinCountF <- function(df){
    if (!is.na(df["track"])) {
      tmp <- subset(jc.data, horsenum == df["horsenum"] & rdate < df["rdate"] & rdate > ymd(df["rdate"]) - years(2) &
                      venue == df["venue"] & track==df["track"] &  distance>=as.integer(df["distance"])-200 &
                      distance<=as.integer(df["distance"])+200)
      if (nrow(tmp) > 0) {
        return(nrow(tmp[tmp$ind_win == 1,]))
      } else {
        return(NA)
      }
    } else {
      return(NA)
    }
  }

  jc.data['h_win_count'] <- apply(jc.data, 1, HWinCountF)

Per OP's needs根据OP的需要

I want to count the number of horse win in the past two years with the same venue, track, and similar distance我想统计过去两年在相同场地、赛道、相似距离的情况下获胜的马匹数

As this is straighforward aggregation, avoid looping and consider merge with subset of data frame on itself since you appear to need to compare observations with each other.由于这是直接聚合,请避免循环并考虑与数据框的merge subset ,因为您似乎需要相互比较观察结果。 Then run aggregate for horse wins.然后为马获胜运行aggregate Below runs with posted data sample.下面运行发布的数据样本。

# MERGE BY COMMON VARIABLES AND SUBSET RESULTS BY DATE AND DISTANCE
compare_df <- subset(merge(jc.data, jc.data, by=c("horsenum", "venue", "track")),
                     rdate.x < rdate.y &
                     rdate.x > lubridate::ymd(rdate.y) - lubridate::years(2) &
                     distance.x >= as.integer(distance.y) - 200 &
                     distance.x <= as.integer(distance.y) + 200 
              )

# SUM ind_win GROUPED BY COMMON VARIABLES
agg_df <- aggregate(cbind(h_win_count = ind_win.x) ~ horsenum + venue + track, 
                    data = compare_df, FUN=sum)

agg_df
#   horsenum venue track h_win_count
# 1     A366    HV  TURF           0
# 2     A366    ST  TURF           0
# 3     B440    ST  TURF           2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM