简体   繁体   English

数据整理,根据时间阈值保留行

[英]Data wrangling, keep rows based on time threshold

I have a data frame where I have a time variable (seconds) and a site variable, but basically the issue is just regarding the time.我有一个数据框,其中有一个时间变量(秒)和一个站点变量,但基本上问题只与时间有关。 I want to select which rows are "good" or "bad" based on a specific time threshold.我想 select 根据特定的时间阈值,哪些行是“好”或“坏”。

Hypothetically, let's assume that my threshold is time_threshold >= 250假设地,假设我的阈值是time_threshold >= 250

Here's an example dataset:这是一个示例数据集:


data.frame(site = c("site1",
                    "site2",
                    "site2",
                    "site2",
                    "site2",
                    "site2",
                    "site3",
                    "site3",
                    "site1",
                    "site3",
                    "site3"),
           time_difference = c(250,
                               300,
                               277,
                               137,
                               75,
                               85,
                               108,
                               91,
                               0,
                               118,
                               113))

Ultimately I want to have something like this, where each row is assigned either a "good" or "bad":最终我想要这样的东西,其中每一行都被分配了一个“好”或“坏”:


data.frame(site = c("site1",
                    "site2",
                    "site2",
                    "site2",
                    "site2",
                    "site2",
                    "site3",
                    "site3",
                    "site1",
                    "site3",
                    "site3"),
           time_difference = c(250,
                               300,
                               277,
                               137,
                               75,
                               85,
                               108,
                               91,
                               0,
                               118,
                               113),
status = c("good",
                      "good",
                      "good",
                      "good",
                      "bad",
                      "good",
                      "bad",
                      "good",
                      "bad",
                      "bad",
                      "good"))

The way that each row is assigned a status is based on time streaks.为每一行分配状态的方式基于时间条纹。 I'll try to explain: starting from the first row we see that time_difference is equal to 250 which is equal to our threshold thus a good is assigned to the status column, the next two rows are also "good" as they are above the threshold.我将尝试解释:从第一行开始,我们看到time_difference等于 250,这等于我们的阈值,因此将商品分配给状态列,接下来的两行也是“好”,因为它们高于临界点。

Once we get to row four we see that the time difference is 137, in this case we need to cumulatively add all following rows until our threshold is reached.一旦我们到达第四行,我们就会看到时间差是 137,在这种情况下,我们需要累加所有后续行,直到达到我们的阈值。 In this case 137+75+85 = 297 .在这种情况下137+75+85 = 297 Once this is established, the first row is given a good and the last row becomes the start of the new streak, whilst the row that has 75 is given a "bad" (anything between the starting row and start of the next streak is given "bad").一旦确定,第一行给出一个好,最后一行成为新连胜的开始,而有 75 的行被给出一个“坏”(给出起始行和下一个连胜开始之间的任何内容“坏的”)。

This process continues until the end of the dataset.这个过程一直持续到数据集结束。 (ie 85+108+91 = 284 keep 85 and 91 and gives "bad" to 108; 91+118+0+113 = 332 keep 91 and 113 gives "bad" to 118 and 0). (即 85+108+91 = 284 保留 85 和 91,并给 108“差”;91+118+0+113 = 332 保留 91 和 113,给 118 和 0“差”)。

I hope this is relatively clear, basically I want to keep the first row of each 250 streak and make the last row of the steak the next starting row.我希望这是相对清楚的,基本上我想保留每 250 连胜的第一行,并将牛排的最后一行作为下一个起始行。

I think this loop captures your logic.我认为这个循环抓住了你的逻辑。

df$status   <- 'bad'
running_sum <- 0
mem         <- 1

for(i in 1:length(df$time_difference)) {
  
  if(running_sum == 0 & df$time_difference[i] != 0) mem <- i
  
  running_sum <- running_sum + df$time_difference[i]
  
  if(running_sum >= 250) {
    running_sum    <- df$time_difference[i]
    df$status[mem] <- 'good'
    df$status[i]   <- 'good'
  }
}

df
#>     site time_difference status
#> 1  site1             250   good
#> 2  site2             300   good
#> 3  site2             277   good
#> 4  site2             137   good
#> 5  site2              75    bad
#> 6  site2              85   good
#> 7  site3             108    bad
#> 8  site3              91   good
#> 9  site1               0    bad
#> 10 site3             118    bad
#> 11 site3             113   good

Created on 2022-03-17 by the reprex package (v2.0.1)reprex package (v2.0.1) 创建于 2022-03-17

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM