简体   繁体   English

避免 R - dplyr 解决方案中的慢循环?

[英]Avoiding a Slow Loop in R - dplyr Solution?

I've got a problem which I can solve with a slow and clumsy loop in R. However, I'm hoping there's a more elegant (and faster) solution...我有一个问题,我可以用 R 中的缓慢而笨拙的循环来解决。但是,我希望有一个更优雅(更快)的解决方案......

The simplest explanation I can think of: each row of data describes an action on a switch.我能想到的最简单的解释是:每一行数据都描述了一个开关上的动作。 The rows are sorted by switch ID (switch 1, switch 2, etc.) and by the chronological order of the actions.行按开关 ID(开关 1、开关 2 等)和操作的时间顺序排序。 Each switch can either be on or off at any point in time.每个开关都可以在任何时间点打开或关闭。 The action can be "turn on", "turn off" or "leave alone".动作可以是“打开”、“关闭”或“别管”。 For each row I want to know the status of the switch (on or off) both before and after the action described by that row.对于每一行,我想知道该行描述的操作前后的开关状态(打开或关闭)。

Each switch starts in the "off" position.每个开关都从“关闭”位置开始。

(the data I'm working with actually relates to insurance policy data, but this switch-based analogy works and is probably simpler to understand) (我使用的数据实际上与保险单数据有关,但这个基于开关的类比有效并且可能更容易理解)

A reproducible example:一个可重现的例子:

df <- data.frame(switch_id = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3),
                  counter = c(1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4),
                  action = c("on", "off", "on", "off", "on", "same", "same", "same", "on", "same", "same", "same", "off", "off", "off", "on", "off", "same", "on"))

I can get to where I want to be using a not-particularly-elegant loop:我可以使用一个不特别优雅的循环到达我想要的地方:

df$status_before <- NA
df$status_after <- NA

for(i in 1:nrow(df)) 
{

  if(df$counter[i] == 1)
  {
    df$status_before[i] <- FALSE # switch always starts in the "off" position
  }
  else
  {
    df$status_before[i] <- df$status_after[i-1]
  }

  if(df$action[i] == "on") {
    df$status_after[i] <- TRUE
  }
  else if(df$action[i] == "off")
  {
    df$status_after[i] <- FALSE  
  }
  else # "same"
  {
    df$status_after[i] <- df$status_before[i] # leave everything alone
  }

}

...but obviously in R loops are best avoided because they run very slowly. ...但显然在 R 循环中最好避免使用,因为它们运行速度非常慢。 Doesn't matter in this tiny data set of course, but the real data I'm working with has ~1M rows so it could be a problem.当然,在这个小数据集中无关紧要,但我正在处理的真实数据有大约 100 万行,所以这可能是一个问题。

Is there a "vectorised" solution to this, perhaps using dplyr type commands?是否有“矢量化”解决方案,也许使用dplyr类型的命令?

Thank you.谢谢你。

As far as I understand when I look at your loop, you want in status_before a TRUE / FALSE dependent of the action of the previous counter and in status_after a TRUE / FALSE dependent on the action of the actual counter .据我了解,当我查看您的循环时,您希望status_before TRUE / FALSE依赖于前一个counter的动作,而status_after TRUE / FALSE依赖于实际counter的动作。 Did I get that right?我做对了吗? Not quite sure though what you want with the same actions...虽然不太确定你想要什么same动作......

To look at values from previous rows, you can use the lag() function from dplyr (and to look "ahead", use lead() instead).要查看前几行的值,您可以使用dplyrlag()函数(为了“向前看”,请改用lead() )。 This code gives the same output as your loop does:此代码提供与循环相同的输出:

EDITED:编辑:

# change "same" to last value of action (if you don't want to change the actual action column, create a new one)
df <- df %>%
  group_by(switch_id) %>%
  mutate(action = ifelse(action == "same", NA, action)) %>% # mark "same" as NA
  fill(action) # make sure action is a character string!

# do the actual evaluation
df <- df %>%
  group_by(switch_id) %>%
  mutate(status_before = case_when(lag(action) == "on" ~ "TRUE",
                                   lag(action) == "off" ~ "FALSE"),
         status_after = case_when(action == "on" ~ "TRUE",
                                  action == "off" ~ "FALSE"), 
         status_before = replace(status_before, is.na(status_before), "FALSE"))

This should be correct now!这现在应该是正确的!

Here is a data.table solution:这是一个 data.table 解决方案:

Edit : Need to operate by switch_id;编辑:需要通过switch_id操作; as of data.table v.1.12.4, there is a native way to fill in missing values ( nafill ) used in this edit;从 data.table v.1.12.4 开始,有一种本地方法可以填充此编辑中使用的缺失值 ( nafill ); added some comments添加了一些评论

library(data.table)
df <- data.table(switch_id = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
    counter = c(1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7),
    action = c("on", "off", "on", "off", "on", "same", "same", "same", "on", "same", "same", "same", "off", "off", "off"))

# in "status_after", replace "same" by NA and set "off" and "on" to FALSE and TRUE
df[, status_after := as.logical(factor(action, labels=c(FALSE, TRUE, NA)))]

# fill in NA using last observation carried forward, by switch_id
df[, status_after := as.logical(nafill(+(status_after), type = "locf")), by = switch_id]

# status_before: shift status_after (default: lag one), by switch_id
df[, status_before := shift(status_after), by = switch_id]

# set first instance of status_before per switch_id to FALSE
df[, status_before := c(FALSE, status_before[-1]), by = switch_id]

# reorder columns
setcolorder(df, c(1:3, 5, 4))
df
#>     switch_id counter action status_before status_after
#>  1:         1       1     on         FALSE         TRUE
#>  2:         1       2    off          TRUE        FALSE
#>  3:         1       3     on         FALSE         TRUE
#>  4:         1       4    off          TRUE        FALSE
#>  5:         1       5     on         FALSE         TRUE
#>  6:         1       6   same          TRUE         TRUE
#>  7:         1       7   same          TRUE         TRUE
#>  8:         1       8   same          TRUE         TRUE
#>  9:         2       1     on         FALSE         TRUE
#> 10:         2       2   same          TRUE         TRUE
#> 11:         2       3   same          TRUE         TRUE
#> 12:         2       4   same          TRUE         TRUE
#> 13:         2       5    off          TRUE        FALSE
#> 14:         2       6    off         FALSE        FALSE
#> 15:         2       7    off         FALSE        FALSE

Created on 2020-03-12 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 3 月 12 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM