根據小球的條件進行變異或過濾

Question

我目前正在研究R for Data Science ，尤其是練習5.7.1＃8 ，它正在分析library(nycflights13)包數據。

問題如下：

對於每架飛機，計算超過1小時的第一次延誤之前的飛行次數。

我的嘗試是創建一個表，該表使用first()函數查找第一個“超過60分鍾”的延遲：

first_del <- flights %>%
  select(month, day, flight, dep_time, tailnum, dep_delay) %>%
  filter(dep_delay > 60) %>%
  group_by(month,day) %>%
  arrange(month, day, dep_time) %>%
  summarise(flight = first(flight), first_time = first(dep_time))

first_del

# A tibble: 365 x 4
# Groups:   month [?]
    month   day flight first_time
    <int> <int>  <int>      <int>
 1     1     1   4576        811
 2     1     2     22        126
 3     1     3    104         50
 4     1     4    608        106
 5     1     5     11         37
 6     1     6     27        746
 7     1     7    145        756
 8     1     8   4334        740
 9     1     9     51        641
10     1    10    905        743
# ... with 355 more rows

我的想法是，如果flights小標題1中的每一行與月，日匹配，並且小於當天的第一個延遲航班的航班號，則將其標記為該行（例如，從上面的first_del ，航班4576是1月1日首次進行“超過60分鍾的延遲”飛行，然后再進行其他所有飛行）。 所需的輸出如下所示：

  flights %>%
  filter(dep_time > 805) %>%
  select(month, day, flight, dep_time, tag)

# A tibble: 272,933 x 4
   month   day flight dep_time   tag
   <int> <int>  <int>    <int>  <int>
 1     1     1    269      807    1
 2     1     1   4388      809    1
 3     1     1   3538      810    1
 4     1     1   2395      810    1
 5     1     1   4260      811    1
 6     1     1   4576      811    1
 7     1     1    675      811    0
 8     1     1   4537      812    0
 9     1     1    914      813    0
10     1     1    346      814    0

理想情況下，最好根據first_del每天少於或等於航班號的所有行進行計數。 我嘗試使用過濾器的許多組合，％in％，突變，但尚未成功。 我應該創建一個自定義函數嗎？

我最終想要的輸出是（帶有虛擬的 $count值）：

 first_del

# A tibble: 365 x 4
# Groups:   month [?]
    month   day flight first_time  count
    <int> <int>  <int>      <int>  <int>
 1     1     1   4576        811    212
 2     1     2     22        126    216
 3     1     3    104         50    298
 4     1     4    608        106    220
 5     1     5     11         37    168
 6     1     6     27        746    287
 7     1     7    145        756    302
 8     1     8   4334        740    246
 9     1     9     51        641    235
10     1    10    905        743    313

其中$count是當天第一趟延誤航班之前的航班數（如上面鏈接中的問題所要求的）。

Answer 1

您可以在邏輯向量上使用which.max來確定滿足條件的第一個實例。 您還需要檢查實際發生的情況。

library(dplyr)
library(nycflights13)

flights %>%
  mutate(dep_delay = coalesce(dep_delay, 0)) %>%
  arrange(month, day, dep_time) %>%
  group_by(tailnum) %>%
  summarise(max_delay = max(dep_delay), 
            which_first_geq_1hr = which.max(dep_delay > 60)) %>%
  ungroup %>%
  filter(max_delay > 60)

我假設延遲意味着起飛延遲， NA延遲意味着0或至少少於一個小時，而我忽略了“失敗”延遲超過一個小時的飛機。 coalesce對於避免which.max(NA)是必需的。

Answer 2

問題是每架飛機，所以您真的想按tailnum分組進行tailnum 。 您可以添加一個標志列，但實際上您最終需要獲得可以傳遞給filter （邏輯矢量）或slice （行索引矢量）的內容。 有多種方法可以執行此操作，例如slice(seq(c(which(dep_delay > 60) - 1, n())[1])) ，但是一種不錯的方法是使用dplyr的cumall （ all的累積版本，就像cumsum是sum ）生成用於filter的邏輯向量：

library(dplyr)

nycflights13::flights %>% 
    group_by(tailnum) %>% 
    arrange(year, month, day, dep_time) %>%    # ensure order before cumany
    filter(cumall(pmax(dep_delay, arr_delay) < 60)) %>% 
    tally()    # count number of observations per group (tailnum)
#> # A tibble: 3,709 x 2
#> # Groups:   tailnum [3,709]
#>    tailnum     n
#>    <chr>   <int>
#>  1 N10156      9
#>  2 N102UW     25
#>  3 N103US     46
#>  4 N104UW      3
#>  5 N105UW     22
#>  6 N107US     20
#>  7 N108UW     36
#>  8 N109UW     28
#>  9 N110UW     15
#> 10 N11107      7
#> # ... with 3,699 more rows

這有可能使第一大延遲的中間表，做一個自聯接ID他們在哪里，或標志值增加了一些意見if_else ，但無論如何，子集化達到這些行仍需要類似的邏輯的cumall以上，因此它們實際上只是更長，更慢的方法。

根據小球的條件進行變異或過濾

問題描述

2 個解決方案

解決方案1
0 2018-07-21 05:22:24

解決方案2
0 2018-07-21 05:41:39

根據小球的條件進行變異或過濾

問題描述

2 個解決方案

解決方案1 0 2018-07-21 05:22:24

解決方案2 0 2018-07-21 05:41:39

解決方案1
0 2018-07-21 05:22:24

解決方案2
0 2018-07-21 05:41:39