简体   繁体   English

根据小球的条件进行变异或过滤

[英]mutate or filter according to conditions from tibble

I'm currently working through R for Data Science , and specifically working on exercise 5.7.1 #8 which is analyzing the library(nycflights13) package data. 我目前正在研究R for Data Science ,尤其是练习5.7.1#8 ,它正在分析library(nycflights13)包数据。

The question reads: 问题如下:

  1. For each plane, count the number of flights before the first delay of greater than 1 hour. 对于每架飞机,计算超过1小时的第一次延误之前的飞行次数。

My attempt was to create a table that finds the first "over 60 minute" delay by using the first() function: 我的尝试是创建一个表,该表使用first()函数查找第一个“超过60分钟”的延迟:

first_del <- flights %>%
  select(month, day, flight, dep_time, tailnum, dep_delay) %>%
  filter(dep_delay > 60) %>%
  group_by(month,day) %>%
  arrange(month, day, dep_time) %>%
  summarise(flight = first(flight), first_time = first(dep_time))

first_del

# A tibble: 365 x 4
# Groups:   month [?]
    month   day flight first_time
    <int> <int>  <int>      <int>
 1     1     1   4576        811
 2     1     2     22        126
 3     1     3    104         50
 4     1     4    608        106
 5     1     5     11         37
 6     1     6     27        746
 7     1     7    145        756
 8     1     8   4334        740
 9     1     9     51        641
10     1    10    905        743
# ... with 355 more rows

My idea is to tag each row in the flights tibble 1 if if it matches the month, day, and is less than the flight number of the first delayed flight on that day (for example, from the first_del tibble above, flight 4576 is the first "over 60 minute delayed" flight on Jan 1, and every other flight before it will count). 我的想法是,如果flights小标题1中的每一行与月,日匹配,并且小于当天的第一个延迟航班的航班号,则将其标记为该行(例如,从上面的first_del ,航班4576是1月1日首次进行“超过60分钟的延迟”飞行,然后再进行其他所有飞行)。 The desired output would be something like: 所需的输出如下所示:

  flights %>%
  filter(dep_time > 805) %>%
  select(month, day, flight, dep_time, tag)

# A tibble: 272,933 x 4
   month   day flight dep_time   tag
   <int> <int>  <int>    <int>  <int>
 1     1     1    269      807    1
 2     1     1   4388      809    1
 3     1     1   3538      810    1
 4     1     1   2395      810    1
 5     1     1   4260      811    1
 6     1     1   4576      811    1
 7     1     1    675      811    0
 8     1     1   4537      812    0
 9     1     1    914      813    0
10     1     1    346      814    0

Ideally, it would be great to tally all rows less than or equal to the flight number on each day according to the first_del tibble. 理想情况下,最好根据first_del每天少于或等于航班号的所有行进行计数。 I've tried to use many combinations of filter, %in%, mutate, but have not yet been successful. 我尝试使用过滤器的许多组合,%in%,突变,但尚未成功。 Should I create a custom function? 我应该创建一个自定义函数吗?

My ultimate desired output is (with fictitious $count values): 我最终想要的输出是(带有虚拟的 $count值):

 first_del

# A tibble: 365 x 4
# Groups:   month [?]
    month   day flight first_time  count
    <int> <int>  <int>      <int>  <int>
 1     1     1   4576        811    212
 2     1     2     22        126    216
 3     1     3    104         50    298
 4     1     4    608        106    220
 5     1     5     11         37    168
 6     1     6     27        746    287
 7     1     7    145        756    302
 8     1     8   4334        740    246
 9     1     9     51        641    235
10     1    10    905        743    313

where $count is the number of flights that preceded the first delayed flight on that day (as wanted by the question in the links above). 其中$count是当天第一趟延误航班之前的航班数(如上面链接中的问题所要求的)。

You can use which.max on a logical vector to determine the first instance satisfying a condition. 您可以在逻辑向量上使用which.max来确定满足条件的第一个实例。 You also need to check the condition actually occurs. 您还需要检查实际发生的情况。

library(dplyr)
library(nycflights13)

flights %>%
  mutate(dep_delay = coalesce(dep_delay, 0)) %>%
  arrange(month, day, dep_time) %>%
  group_by(tailnum) %>%
  summarise(max_delay = max(dep_delay), 
            which_first_geq_1hr = which.max(dep_delay > 60)) %>%
  ungroup %>%
  filter(max_delay > 60)

I'm assuming that delay means departure delay, NA delay means 0 or at least less than an hour, and I'm ignoring planes that 'failed' to be delayed by more than hour. 我假设延迟意味着起飞延迟, NA延迟意味着0或至少少于一个小时,而我忽略了“失败”延迟超过一个小时的飞机。 The coalesce is necessary to avoid which.max(NA) . coalesce对于避免which.max(NA)是必需的。

The question is per plane, so you really want to operate grouped by tailnum . 问题是每架飞机,所以您真的想按tailnum分组进行tailnum You can add a flag column, but really you need to end up with something you can pass to filter (a logical vector) or slice (a vector of row indices). 您可以添加一个标志列,但实际上您最终需要获得可以传递给filter (逻辑矢量)或slice (行索引矢量)的内容。 There are various ways to do this, eg slice(seq(c(which(dep_delay > 60) - 1, n())[1])) , but a nice approach is to use dplyr's cumall (a cumulative version of all , like cumsum is to sum ) to generate a logical vector for filter : 有多种方法可以执行此操作,例如slice(seq(c(which(dep_delay > 60) - 1, n())[1])) ,但是一种不错的方法是使用dplyr的cumallall的累积版本,就像cumsumsum )生成用于filter的逻辑向量:

library(dplyr)

nycflights13::flights %>% 
    group_by(tailnum) %>% 
    arrange(year, month, day, dep_time) %>%    # ensure order before cumany
    filter(cumall(pmax(dep_delay, arr_delay) < 60)) %>% 
    tally()    # count number of observations per group (tailnum)
#> # A tibble: 3,709 x 2
#> # Groups:   tailnum [3,709]
#>    tailnum     n
#>    <chr>   <int>
#>  1 N10156      9
#>  2 N102UW     25
#>  3 N103US     46
#>  4 N104UW      3
#>  5 N105UW     22
#>  6 N107US     20
#>  7 N108UW     36
#>  8 N109UW     28
#>  9 N110UW     15
#> 10 N11107      7
#> # ... with 3,699 more rows

It's possible to make a intermediate table of first big delays and do a self-join to ID where they are, or add a flag value to some observations with if_else , but regardless, subsetting up to those rows will still require similar logic to the cumall above, so they're really just longer, slower approaches. 这有可能使第一大延迟的中间表,做一个自联接ID他们在哪里,或标志值增加了一些意见if_else ,但无论如何,子集化达到这些行仍需要类似的逻辑的cumall以上,因此它们实际上只是更长,更慢的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM