[英]mutate or filter according to conditions from tibble
I'm currently working through R for Data Science , and specifically working on exercise 5.7.1 #8 which is analyzing the library(nycflights13)
package data. 我目前正在研究R for Data Science ,尤其是练习5.7.1#8 ,它正在分析
library(nycflights13)
包数据。
The question reads: 问题如下:
- For each plane, count the number of flights before the first delay of greater than 1 hour.
对于每架飞机,计算超过1小时的第一次延误之前的飞行次数。
My attempt was to create a table that finds the first "over 60 minute" delay by using the first()
function: 我的尝试是创建一个表,该表使用
first()
函数查找第一个“超过60分钟”的延迟:
first_del <- flights %>%
select(month, day, flight, dep_time, tailnum, dep_delay) %>%
filter(dep_delay > 60) %>%
group_by(month,day) %>%
arrange(month, day, dep_time) %>%
summarise(flight = first(flight), first_time = first(dep_time))
first_del
# A tibble: 365 x 4
# Groups: month [?]
month day flight first_time
<int> <int> <int> <int>
1 1 1 4576 811
2 1 2 22 126
3 1 3 104 50
4 1 4 608 106
5 1 5 11 37
6 1 6 27 746
7 1 7 145 756
8 1 8 4334 740
9 1 9 51 641
10 1 10 905 743
# ... with 355 more rows
My idea is to tag each row in the flights
tibble 1 if if it matches the month, day, and is less than the flight number of the first delayed flight on that day (for example, from the first_del
tibble above, flight 4576 is the first "over 60 minute delayed" flight on Jan 1, and every other flight before it will count). 我的想法是,如果
flights
小标题1中的每一行与月,日匹配,并且小于当天的第一个延迟航班的航班号,则将其标记为该行(例如,从上面的first_del
,航班4576是1月1日首次进行“超过60分钟的延迟”飞行,然后再进行其他所有飞行)。 The desired output would be something like: 所需的输出如下所示:
flights %>%
filter(dep_time > 805) %>%
select(month, day, flight, dep_time, tag)
# A tibble: 272,933 x 4
month day flight dep_time tag
<int> <int> <int> <int> <int>
1 1 1 269 807 1
2 1 1 4388 809 1
3 1 1 3538 810 1
4 1 1 2395 810 1
5 1 1 4260 811 1
6 1 1 4576 811 1
7 1 1 675 811 0
8 1 1 4537 812 0
9 1 1 914 813 0
10 1 1 346 814 0
Ideally, it would be great to tally all rows less than or equal to the flight number on each day according to the first_del
tibble. 理想情况下,最好根据
first_del
每天少于或等于航班号的所有行进行计数。 I've tried to use many combinations of filter, %in%, mutate, but have not yet been successful. 我尝试使用过滤器的许多组合,%in%,突变,但尚未成功。 Should I create a custom function?
我应该创建一个自定义函数吗?
My ultimate desired output is (with fictitious $count
values): 我最终想要的输出是(带有虚拟的
$count
值):
first_del
# A tibble: 365 x 4
# Groups: month [?]
month day flight first_time count
<int> <int> <int> <int> <int>
1 1 1 4576 811 212
2 1 2 22 126 216
3 1 3 104 50 298
4 1 4 608 106 220
5 1 5 11 37 168
6 1 6 27 746 287
7 1 7 145 756 302
8 1 8 4334 740 246
9 1 9 51 641 235
10 1 10 905 743 313
where $count
is the number of flights that preceded the first delayed flight on that day (as wanted by the question in the links above). 其中
$count
是当天第一趟延误航班之前的航班数(如上面链接中的问题所要求的)。
You can use which.max
on a logical vector to determine the first instance satisfying a condition. 您可以在逻辑向量上使用
which.max
来确定满足条件的第一个实例。 You also need to check the condition actually occurs. 您还需要检查实际发生的情况。
library(dplyr)
library(nycflights13)
flights %>%
mutate(dep_delay = coalesce(dep_delay, 0)) %>%
arrange(month, day, dep_time) %>%
group_by(tailnum) %>%
summarise(max_delay = max(dep_delay),
which_first_geq_1hr = which.max(dep_delay > 60)) %>%
ungroup %>%
filter(max_delay > 60)
I'm assuming that delay means departure delay, NA
delay means 0
or at least less than an hour, and I'm ignoring planes that 'failed' to be delayed by more than hour. 我假设延迟意味着起飞延迟,
NA
延迟意味着0
或至少少于一个小时,而我忽略了“失败”延迟超过一个小时的飞机。 The coalesce
is necessary to avoid which.max(NA)
. coalesce
对于避免which.max(NA)
是必需的。
The question is per plane, so you really want to operate grouped by tailnum
. 问题是每架飞机,所以您真的想按
tailnum
分组进行tailnum
。 You can add a flag column, but really you need to end up with something you can pass to filter
(a logical vector) or slice
(a vector of row indices). 您可以添加一个标志列,但实际上您最终需要获得可以传递给
filter
(逻辑矢量)或slice
(行索引矢量)的内容。 There are various ways to do this, eg slice(seq(c(which(dep_delay > 60) - 1, n())[1]))
, but a nice approach is to use dplyr's cumall
(a cumulative version of all
, like cumsum
is to sum
) to generate a logical vector for filter
: 有多种方法可以执行此操作,例如
slice(seq(c(which(dep_delay > 60) - 1, n())[1]))
,但是一种不错的方法是使用dplyr的cumall
( all
的累积版本,就像cumsum
是sum
)生成用于filter
的逻辑向量:
library(dplyr)
nycflights13::flights %>%
group_by(tailnum) %>%
arrange(year, month, day, dep_time) %>% # ensure order before cumany
filter(cumall(pmax(dep_delay, arr_delay) < 60)) %>%
tally() # count number of observations per group (tailnum)
#> # A tibble: 3,709 x 2
#> # Groups: tailnum [3,709]
#> tailnum n
#> <chr> <int>
#> 1 N10156 9
#> 2 N102UW 25
#> 3 N103US 46
#> 4 N104UW 3
#> 5 N105UW 22
#> 6 N107US 20
#> 7 N108UW 36
#> 8 N109UW 28
#> 9 N110UW 15
#> 10 N11107 7
#> # ... with 3,699 more rows
It's possible to make a intermediate table of first big delays and do a self-join to ID where they are, or add a flag value to some observations with if_else
, but regardless, subsetting up to those rows will still require similar logic to the cumall
above, so they're really just longer, slower approaches. 这有可能使第一大延迟的中间表,做一个自联接ID他们在哪里,或标志值增加了一些意见
if_else
,但无论如何,子集化达到这些行仍需要类似的逻辑的cumall
以上,因此它们实际上只是更长,更慢的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.