[英]How to summarize data by-group, by creating dummy variables as the collapsing method
I'm trying to summarize a dataset by groups, to have dummy columns for whether each group's values appear among the data's ungrouped most frequent values.我正在尝试按组汇总数据集,以便为每个组的值是否出现在数据未分组的最频繁值中设置虚拟列。
As an example, let's take flights
data from nycflights13
.例如,让我们从
nycflights13
获取flights
数据。
library(dplyr, warn.conflicts = FALSE)
library(nycflights13)
my_flights_raw <-
flights %>%
select(carrier, month, dest)
my_flights_raw
#> # A tibble: 336,776 x 3
#> carrier month dest
#> <chr> <int> <chr>
#> 1 UA 1 IAH
#> 2 UA 1 IAH
#> 3 AA 1 MIA
#> 4 B6 1 BQN
#> 5 DL 1 ATL
#> 6 UA 1 ORD
#> 7 B6 1 FLL
#> 8 EV 1 IAD
#> 9 B6 1 MCO
#> 10 AA 1 ORD
#> # ... with 336,766 more rows
My end-goal: I'm interested to know about each carrier
in each month
: whether it flew to the most popular destinations.我的最终目标:我有兴趣了解每个
carrier
month
:它是否飞往最受欢迎的目的地。 I define "most popular" by the top-5 most frequent dest
values in each month, then intersecting all months' top-5s.我通过每个月前 5 名最频繁的
dest
值定义“最受欢迎” ,然后与所有月份的前 5 名相交。
step 1第1步
I start by simple aggregation by months:我首先按月进行简单聚合:
my_flights_agg <-
my_flights_raw %>%
count(month, dest, name = "n_obs") %>%
arrange(month, desc(n_obs))
my_flights_agg
#> # A tibble: 1,113 x 3
#> month dest n_obs
#> <int> <chr> <int>
#> 1 1 ATL 1396
#> 2 1 ORD 1269
#> 3 1 BOS 1245
#> 4 1 MCO 1175
#> 5 1 FLL 1161
#> 6 1 LAX 1159
#> 7 1 CLT 1058
#> 8 1 MIA 981
#> 9 1 SFO 889
#> 10 1 DCA 865
#> # ... with 1,103 more rows
step 2第2步
And now I'm going to cut the data to keep only the top 5 most popular per month.现在我将削减数据,只保留每月最受欢迎的前 5 名。
my_flights_top_5_by_month <-
my_flights_agg %>%
group_by(month) %>%
slice_max(order_by = n_obs, n = 5)
my_flights_top_5_by_month
#> # A tibble: 60 x 3
#> # Groups: month [12]
#> month dest n_obs
#> <int> <chr> <int>
#> 1 1 ATL 1396
#> 2 1 ORD 1269
#> 3 1 BOS 1245
#> 4 1 MCO 1175
#> 5 1 FLL 1161
#> 6 2 ATL 1267
#> 7 2 ORD 1197
#> 8 2 BOS 1182
#> 9 2 MCO 1110
#> 10 2 FLL 1073
#> # ... with 50 more rows
step 3第 3 步
Now simply get the unique()
of my_flights_top_5_by_month$dest
:现在只需获取
my_flights_top_5_by_month$dest
的unique()
:
my_flights_top_dest_across_months <- unique(my_flights_top_5_by_month$dest)
## [1] "ATL" "ORD" "BOS" "MCO" "FLL" "LAX" "SFO" "CLT"
Here's my question: given my_flights_top_dest_across_months
, how can I summarize my_flights_raw
to distinct carrier
& month
, such that the collapsing principle is whether each combination of carrier
& month
had flawn to each of the dest
values in my_flights_top_dest_across_months
?这是我的问题:鉴于
my_flights_top_dest_across_months
,我如何将my_flights_raw
总结为不同的carrier
& month
,这样折叠原则是carrier
& month
每个组合是否对 my_flights_top_dest_across_months 中的每个dest
值my_flights_top_dest_across_months
?
desired output期望的输出
## carrier month ATL ORD BOS MCO FLL LAX SFO CLT
## <chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 9E 1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 2 9E 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 3 9E 3 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 4 9E 4 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 5 9E 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 6 9E 6 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 7 9E 7 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 8 9E 8 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 9 9E 9 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 10 9E 10 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## # ... with 175 more rows
I currently have the following code that is simply inefficient.我目前有以下代码效率低下。 It works fine for the example
flights
data, but is taking forever when applied on a large dataset (with several millions rows and groups).它适用于示例
flights
数据,但在应用于大型数据集(具有数百万行和组)时需要永远。 Any idea how the task described above can be done more efficiently?知道如何更有效地完成上述任务吗?
# too slow :(
op_slow_output <-
my_flights_raw %>%
group_by(carrier, month) %>%
summarise(destinations_vec = list(unique(dest))) %>%
add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month)) )) %>%
tidyr::unnest_wider(are_top_dest_included)
It is quite possible that using the data.table
library will be faster here.很有可能在这里使用
data.table
库会更快。 I will not argue.我不会争论。 But I have mastered
dplyr
and would like to offer a pretty cool solution using the functions from this particular library.但是我已经掌握了
dplyr
并且想使用这个特定库中的函数提供一个非常酷的解决方案。
First, let's prepare two little auxiliary functions.首先我们准备两个小辅助函数。 We will see how they work later.
我们稍后会看到它们是如何工作的。
library(nycflights13)
library(tidyverse)
ftopDest = function(data, ntop){
data %>%
group_by(dest) %>%
summarise(ndest = n()) %>%
arrange(desc(ndest)) %>%
pull(dest) %>% .[1:ntop]
}
carrierToTopDest = function(data, topDest){
data %>% mutate(carrierToToDest = dest %in% topDest)
}
Now you only need one simple mutation !现在你只需要一个简单的突变!
df = flights %>% nest_by(year, month) %>% #Step 1
mutate(topDest = list(ftopDest(data, 5)), #Step 2
data = list(carrierToTopDest(data, topDest))) #Step 3
But let me describe step by step what is happening here.但让我逐步描述这里发生的事情。
In step one, let's nest the data into an internal tibble
named data
.在第一步中,让我们将数据嵌套到名为
data
的内部tibble
。
Output after Step 1步骤 1 后的输出
# A tibble: 12 x 3
# Rowwise: year, month
year month data
<int> <int> <list<tibble[,17]>>
1 2013 1 [27,004 x 17]
2 2013 2 [24,951 x 17]
3 2013 3 [28,834 x 17]
4 2013 4 [28,330 x 17]
5 2013 5 [28,796 x 17]
6 2013 6 [28,243 x 17]
7 2013 7 [29,425 x 17]
8 2013 8 [29,327 x 17]
9 2013 9 [27,574 x 17]
10 2013 10 [28,889 x 17]
11 2013 11 [27,268 x 17]
12 2013 12 [28,135 x 17]
In step 2, we add the most popular flight destinations.在第 2 步中,我们添加最受欢迎的航班目的地。
Output after step 2步骤 2 后的输出
# A tibble: 12 x 4
# Rowwise: year, month
year month data topDest
<int> <int> <list<tibble[,17]>> <list>
1 2013 1 [27,004 x 17] <chr [5]>
2 2013 2 [24,951 x 17] <chr [5]>
3 2013 3 [28,834 x 17] <chr [5]>
4 2013 4 [28,330 x 17] <chr [5]>
5 2013 5 [28,796 x 17] <chr [5]>
6 2013 6 [28,243 x 17] <chr [5]>
7 2013 7 [29,425 x 17] <chr [5]>
8 2013 8 [29,327 x 17] <chr [5]>
9 2013 9 [27,574 x 17] <chr [5]>
10 2013 10 [28,889 x 17] <chr [5]>
11 2013 11 [27,268 x 17] <chr [5]>
12 2013 12 [28,135 x 17] <chr [5]>
In the last step, we add the carrierToToDest
variable to the data
variable, which determines whether the flight was going to one of the ntop
places from the given month.在最后一步中,我们将
carrierToToDest
变量添加到data
变量中,该变量确定航班是否前往给定月份的ntop
地点之一。
Output after step 3第 3 步后的输出
# A tibble: 12 x 4
# Rowwise: year, month
year month data topDest
<int> <int> <list> <list>
1 2013 1 <tibble [27,004 x 18]> <chr [5]>
2 2013 2 <tibble [24,951 x 18]> <chr [5]>
3 2013 3 <tibble [28,834 x 18]> <chr [5]>
4 2013 4 <tibble [28,330 x 18]> <chr [5]>
5 2013 5 <tibble [28,796 x 18]> <chr [5]>
6 2013 6 <tibble [28,243 x 18]> <chr [5]>
7 2013 7 <tibble [29,425 x 18]> <chr [5]>
8 2013 8 <tibble [29,327 x 18]> <chr [5]>
9 2013 9 <tibble [27,574 x 18]> <chr [5]>
10 2013 10 <tibble [28,889 x 18]> <chr [5]>
11 2013 11 <tibble [27,268 x 18]> <chr [5]>
12 2013 12 <tibble [28,135 x 18]> <chr [5]>
How now we can see the most popular places.现在我们如何才能看到最受欢迎的地方。 Let's do this:
我们开工吧:
df %>% mutate(topDest = paste(topDest, collapse = " "))
output输出
# A tibble: 12 x 4
# Rowwise: year, month
year month data topDest
<int> <int> <list> <chr>
1 2013 1 <tibble [27,004 x 18]> ATL ORD BOS MCO FLL
2 2013 2 <tibble [24,951 x 18]> ATL ORD BOS MCO FLL
3 2013 3 <tibble [28,834 x 18]> ATL ORD BOS MCO FLL
4 2013 4 <tibble [28,330 x 18]> ATL ORD LAX BOS MCO
5 2013 5 <tibble [28,796 x 18]> ORD ATL LAX BOS SFO
6 2013 6 <tibble [28,243 x 18]> ORD ATL LAX BOS SFO
7 2013 7 <tibble [29,425 x 18]> ORD ATL LAX BOS CLT
8 2013 8 <tibble [29,327 x 18]> ORD ATL LAX BOS SFO
9 2013 9 <tibble [27,574 x 18]> ORD LAX ATL BOS CLT
10 2013 10 <tibble [28,889 x 18]> ORD ATL LAX BOS CLT
11 2013 11 <tibble [27,268 x 18]> ATL ORD LAX BOS CLT
12 2013 12 <tibble [28,135 x 18]> ATL LAX MCO ORD CLT
Can we see flights to these destinations?我们能看到飞往这些目的地的航班吗? Of course, it's not difficult.
当然,这并不难。
df %>% select(-topDest) %>%
unnest(data) %>%
filter(carrierToToDest) %>%
select(year, month, flight, carrier, dest)
Output输出
# A tibble: 80,941 x 5
# Groups: year, month [12]
year month flight carrier dest
<int> <int> <int> <chr> <chr>
1 2013 1 461 DL ATL
2 2013 1 1696 UA ORD
3 2013 1 507 B6 FLL
4 2013 1 79 B6 MCO
5 2013 1 301 AA ORD
6 2013 1 1806 B6 BOS
7 2013 1 371 B6 FLL
8 2013 1 4650 MQ ATL
9 2013 1 1743 DL ATL
10 2013 1 3768 MQ ORD
# ... with 80,931 more rows
This is my recipe.这是我的食谱。 Very simple and transparent in my opinion.
在我看来非常简单和透明。 I would be extremely obligated if you would try it on your data and let me know with efficiency.
如果您能在您的数据上尝试并高效地告诉我,我将非常有义务。
I just noticed that I wanted to group not only after year
(although you don't mention it, it must be so), month
, but also by the carrier
variable.我只是注意到我不仅要在
year
之后分组(虽然你没有提到它,但必须如此), month
,而且还要按carrier
变量分组。 So let's add it as another grouping variable.因此,让我们将其添加为另一个分组变量。
df = flights %>% nest_by(year, month, carrier) %>%
mutate(topDest = list(ftopDest(data, 5)),
data = list(carrierToTopDest(data, topDest)))
output输出
# A tibble: 185 x 5
# Rowwise: year, month, carrier
year month carrier data topDest
<int> <int> <chr> <list> <list>
1 2013 1 9E <tibble [1,573 x 17]> <chr [5]>
2 2013 1 AA <tibble [2,794 x 17]> <chr [5]>
3 2013 1 AS <tibble [62 x 17]> <chr [5]>
4 2013 1 B6 <tibble [4,427 x 17]> <chr [5]>
5 2013 1 DL <tibble [3,690 x 17]> <chr [5]>
6 2013 1 EV <tibble [4,171 x 17]> <chr [5]>
7 2013 1 F9 <tibble [59 x 17]> <chr [5]>
8 2013 1 FL <tibble [328 x 17]> <chr [5]>
9 2013 1 HA <tibble [31 x 17]> <chr [5]>
10 2013 1 MQ <tibble [2,271 x 17]> <chr [5]>
# ... with 175 more rows
Now let's get acquainted with the new top 5 directions.现在让我们熟悉新的前 5 个方向。
df %>% mutate(topDest = paste(topDest, collapse = " "))
output输出
# A tibble: 185 x 5
# Rowwise: year, month, carrier
year month carrier data topDest
<int> <int> <chr> <list> <chr>
1 2013 1 9E <tibble [1,573 x 17]> BOS PHL CVG MSP ORD
2 2013 1 AA <tibble [2,794 x 17]> DFW MIA ORD LAX BOS
3 2013 1 AS <tibble [62 x 17]> SEA NA NA NA NA
4 2013 1 B6 <tibble [4,427 x 17]> FLL MCO BOS PBI SJU
5 2013 1 DL <tibble [3,690 x 17]> ATL DTW MCO FLL MIA
6 2013 1 EV <tibble [4,171 x 17]> IAD DTW DCA RDU CVG
7 2013 1 F9 <tibble [59 x 17]> DEN NA NA NA NA
8 2013 1 FL <tibble [328 x 17]> ATL CAK MKE NA NA
9 2013 1 HA <tibble [31 x 17]> HNL NA NA NA NA
10 2013 1 MQ <tibble [2,271 x 17]> RDU CMH ORD BNA ATL
Given my_flights_top_5_by_month
and my_flights_raw
, we can try the following data.table
approach鉴于
my_flights_top_5_by_month
和my_flights_raw
,我们可以尝试以下data.table
方法
library(data.table)
dcast(
setDT(my_flights_top_5_by_month)[
setDT(my_flights_raw),
on = .(month, dest)
][, n_obs := !is.na(n_obs)],
carrier + month ~ dest,
fun.aggregate = function(x) sum(x) > 0,
value.var = "n_obs"
)
which gives这使
carrier month ABQ ACK ALB ANC ATL AUS AVL BDL BGR BHM
1: 9E 1 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
2: 9E 2 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
3: 9E 3 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
4: 9E 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5: 9E 5 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
---
181: YV 8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: YV 9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: YV 10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: YV 11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: YV 12 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
BNA BOS BQN BTV BUF BUR BWI BZN CAE CAK CHO CHS
1: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
CLE CLT CMH CRW CVG DAY DCA DEN DFW DSM DTW EGE
1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
EYW FLL GRR GSO GSP HDN HNL HOU IAD IAH ILM IND
1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
JAC JAX LAS LAX LEX LGA LGB MCI MCO MDW MEM MHT
1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
MIA MKE MSN MSP MSY MTJ MVY MYR OAK OKC OMA ORD
1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
ORF PBI PDX PHL PHX PIT PSE PSP PVD PWM RDU RIC
1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
ROC RSW SAN SAT SAV SBN SDF SEA SFO SJC SJU SLC
1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
SMF SNA SRQ STL STT SYR TPA TUL TVC TYS XNA
1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
I took a stub myself, using functions from the collapse
package.我自己拿了一个存根,使用了
collapse
包中的函数。
library(magrittr)
library(collapse)
library(data.table)
my_flights_raw %>%
collapse::funique() %>%
collapse::fgroup_by(carrier, month) %>%
collapse::fsummarise(nested_dest = list(dest)) %>%
collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
collapse::qTBL()
Unsurprisingly, collapse
gives the fastest execution time.不出所料,
collapse
提供了最快的执行时间。 But I was surprised to see that @ThomasIsCoding's solution based on data.table
was slower than my original tidyverse
mix-and-match solution.但我很惊讶地看到,@ ThomasIsCoding的解决方案基于
data.table
较慢比我原来的tidyverse
混合和匹配的解决方案。
I also factored in the single data.table
dependency in Thomas's answer, compared to the variety of dependencies in my original method.与我原始方法中的各种依赖项相比,我还在 Thomas 的答案中考虑了单个
data.table
依赖项。
library(nycflights13)
library(dplyr, warn.conflicts = FALSE)
# OP original
my_flights_raw <-
flights %>%
select(carrier, month, dest)
my_flights_agg <-
my_flights_raw %>%
count(month, dest, name = "n_obs") %>%
arrange(month, desc(n_obs))
my_flights_top_dest_across_months <-
my_flights_agg %>%
group_by(month) %>%
slice_max(order_by = n_obs, n = 5) %>%
pull(dest) %>%
unique()
my_flights_top_5_by_month <-
my_flights_agg %>%
group_by(month) %>%
slice_max(order_by = n_obs, n = 5)
my_flights_top_dest_across_month <- unique(my_flights_top_5_by_month$dest)
op_slow <- function() {
library(tidyr)
library(tibble)
library(purrr)
my_flights_raw %>%
group_by(carrier, month) %>%
summarise(destinations_vec = list(unique(dest))) %>%
add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month)) )) %>%
tidyr::unnest_wider(are_top_dest_included)
}
# OP collapse
op_collapse <- function() {
library(magrittr)
library(collapse)
library(data.table)
my_flights_raw %>%
collapse::funique() %>%
collapse::fgroup_by(carrier, month) %>%
collapse::fsummarise(nested_dest = list(dest)) %>%
collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
collapse::qTBL()
}
# Thomas data.table
thomas_data.table <- function() {
library(data.table)
my_flights_top_dest_across_months <-
data.table(
dest = unique(my_flights_top_5_by_month$dest),
fd = 1
)
dcast(my_flights_top_dest_across_months[
setDT(my_flights_raw),
on = .(dest)
],
carrier + month ~ dest,
fun.aggregate = function(x) sum(x) > 0
)[, c(
"carrier", "month",
my_flights_top_dest_across_months$dest
), with = FALSE]
}
output_op_slow <- op_slow()
output_op_collapse <- op_collapse()
output_thomas <- thomas_data.table()
#> Using 'month' as value column. Use 'value.var' to override
waldo::compare(output_op_slow, output_op_collapse, ignore_attr = TRUE)
#> v No differences
waldo::compare(output_op_slow, as_tibble(output_thomas), ignore_attr = TRUE)
#> v No differences
bm <- bench::mark(op_slow = op_slow(),
op_collapse = op_collapse(),
thomas_dt = thomas_data.table(),
check = FALSE,
iterations = 100)
ggplot2::autoplot(bm)
Does this do what you want?这是你想要的吗? As far as I can tell it matches your output but has more rows because it includes all months for all carriers;
据我所知,它与您的输出匹配,但有更多行,因为它包括所有运营商的所有月份;
carrier
"OO" only has flights in 5 months and your version only shows those 5 months in the summary. carrier
“OO”只有 5 个月内的航班,您的版本仅在摘要中显示这 5 个月。
With the data as provided (336k rows), this takes a similar amount of time as your function, but it's faster as you deal with larger data.使用提供的数据(336k 行),这与您的函数花费的时间相似,但处理更大的数据时速度更快。 When I run these on data 100x as big after setting
my_flights_raw <- my_flights_raw %>% tidyr::uncount(100)
, to make it 33M rows, the code below is about 40% faster.当我在设置
my_flights_raw <- my_flights_raw %>% tidyr::uncount(100)
后在 100 倍大的数据上运行这些时,使其成为 33M 行,下面的代码快了大约 40%。
Given the large number of groups you're dealing with, I expect this is a situation where data.table
will really shine with better performance.考虑到您要处理的大量组,我希望在这种情况下
data.table
将真正data.table
更好的性能。
library(tidyverse)
my_flights_raw %>%
count(carrier, month, dest) %>%
complete(carrier, month, dest) %>%
filter(dest %in% my_flights_top_dest_across_months) %>%
mutate(n = if_else(!is.na(n), TRUE, FALSE)) %>%
pivot_wider(names_from = dest, values_from = n)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.