简体   繁体   English

如何通过创建虚拟变量作为折叠方法按组汇总数据

[英]How to summarize data by-group, by creating dummy variables as the collapsing method

I'm trying to summarize a dataset by groups, to have dummy columns for whether each group's values appear among the data's ungrouped most frequent values.我正在尝试按组汇总数据集,以便为每个组的值是否出现在数据未分组的最频繁值中设置虚拟列。

As an example, let's take flights data from nycflights13 .例如,让我们从nycflights13获取flights数据。

library(dplyr, warn.conflicts = FALSE)
library(nycflights13)

my_flights_raw <-
  flights %>%
  select(carrier, month, dest)

my_flights_raw
#> # A tibble: 336,776 x 3
#>    carrier month dest 
#>    <chr>   <int> <chr>
#>  1 UA          1 IAH  
#>  2 UA          1 IAH  
#>  3 AA          1 MIA  
#>  4 B6          1 BQN  
#>  5 DL          1 ATL  
#>  6 UA          1 ORD  
#>  7 B6          1 FLL  
#>  8 EV          1 IAD  
#>  9 B6          1 MCO  
#> 10 AA          1 ORD  
#> # ... with 336,766 more rows

My end-goal: I'm interested to know about each carrier in each month : whether it flew to the most popular destinations.我的最终目标:我有兴趣了解每个carrier month :它是否飞往最受欢迎的目的地。 I define "most popular" by the top-5 most frequent dest values in each month, then intersecting all months' top-5s.我通过每个月前 5 名最频繁的dest值定义“最受欢迎” ,然后与所有月份的前 5 名相交。

step 1第1步
I start by simple aggregation by months:我首先按月进行简单聚合:

my_flights_agg <- 
  my_flights_raw %>%
  count(month, dest, name = "n_obs") %>%
  arrange(month, desc(n_obs)) 

my_flights_agg
#> # A tibble: 1,113 x 3
#>    month dest  n_obs
#>    <int> <chr> <int>
#>  1     1 ATL    1396
#>  2     1 ORD    1269
#>  3     1 BOS    1245
#>  4     1 MCO    1175
#>  5     1 FLL    1161
#>  6     1 LAX    1159
#>  7     1 CLT    1058
#>  8     1 MIA     981
#>  9     1 SFO     889
#> 10     1 DCA     865
#> # ... with 1,103 more rows

step 2第2步
And now I'm going to cut the data to keep only the top 5 most popular per month.现在我将削减数据,只保留每月最受欢迎的前 5 名。

my_flights_top_5_by_month <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5)

my_flights_top_5_by_month
#> # A tibble: 60 x 3
#> # Groups:   month [12]
#>    month dest  n_obs
#>    <int> <chr> <int>
#>  1     1 ATL    1396
#>  2     1 ORD    1269
#>  3     1 BOS    1245
#>  4     1 MCO    1175
#>  5     1 FLL    1161
#>  6     2 ATL    1267
#>  7     2 ORD    1197
#>  8     2 BOS    1182
#>  9     2 MCO    1110
#> 10     2 FLL    1073
#> # ... with 50 more rows

step 3第 3 步
Now simply get the unique() of my_flights_top_5_by_month$dest :现在只需获取my_flights_top_5_by_month$destunique()

my_flights_top_dest_across_months <- unique(my_flights_top_5_by_month$dest)

## [1] "ATL" "ORD" "BOS" "MCO" "FLL" "LAX" "SFO" "CLT"

Here's my question: given my_flights_top_dest_across_months , how can I summarize my_flights_raw to distinct carrier & month , such that the collapsing principle is whether each combination of carrier & month had flawn to each of the dest values in my_flights_top_dest_across_months ?这是我的问题:鉴于my_flights_top_dest_across_months ,我如何将my_flights_raw总结为不同的carrier & month ,这样折叠原则是carrier & month每个组合是否对 my_flights_top_dest_across_months 中的每个destmy_flights_top_dest_across_months

desired output期望的输出

##    carrier month ATL   ORD   BOS   MCO   FLL   LAX   SFO   CLT  
##    <chr>   <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 9E          1 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  2 9E          2 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  3 9E          3 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  4 9E          4 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  5 9E          5 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  6 9E          6 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  7 9E          7 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  8 9E          8 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  9 9E          9 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## 10 9E         10 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## # ... with 175 more rows

I currently have the following code that is simply inefficient.我目前有以下代码效率低下。 It works fine for the example flights data, but is taking forever when applied on a large dataset (with several millions rows and groups).它适用于示例flights数据,但在应用于大型数据集(具有数百万行和组)时需要永远。 Any idea how the task described above can be done more efficiently?知道如何更有效地完成上述任务吗?

# too slow :(
op_slow_output <- 
  my_flights_raw %>%
  group_by(carrier, month) %>%
  summarise(destinations_vec = list(unique(dest))) %>%
  add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
  mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
  mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month))  )) %>%
  tidyr::unnest_wider(are_top_dest_included)

It is quite possible that using the data.table library will be faster here.很有可能在这里使用data.table库会更快。 I will not argue.我不会争论。 But I have mastered dplyr and would like to offer a pretty cool solution using the functions from this particular library.但是我已经掌握了dplyr并且想使用这个特定库中的函数提供一个非常酷的解决方案。

First, let's prepare two little auxiliary functions.首先我们准备两个小辅助函数。 We will see how they work later.我们稍后会看到它们是如何工作的。

library(nycflights13)
library(tidyverse)


ftopDest = function(data, ntop){
  data %>% 
    group_by(dest) %>% 
    summarise(ndest = n()) %>% 
    arrange(desc(ndest)) %>% 
    pull(dest) %>% .[1:ntop]
}

carrierToTopDest = function(data, topDest){
  data %>% mutate(carrierToToDest = dest %in% topDest)
}

Now you only need one simple mutation !现在你只需要一个简单的突变

df = flights %>% nest_by(year, month) %>%  #Step 1
  mutate(topDest = list(ftopDest(data, 5)),  #Step 2
         data = list(carrierToTopDest(data, topDest)))  #Step 3
  

But let me describe step by step what is happening here.但让我逐步描述这里发生的事情。

In step one, let's nest the data into an internal tibble named data .在第一步中,让我们将数据嵌套到名为data的内部tibble

Output after Step 1步骤 1 后的输出

# A tibble: 12 x 3
# Rowwise:  year, month
    year month                data
   <int> <int> <list<tibble[,17]>>
 1  2013     1       [27,004 x 17]
 2  2013     2       [24,951 x 17]
 3  2013     3       [28,834 x 17]
 4  2013     4       [28,330 x 17]
 5  2013     5       [28,796 x 17]
 6  2013     6       [28,243 x 17]
 7  2013     7       [29,425 x 17]
 8  2013     8       [29,327 x 17]
 9  2013     9       [27,574 x 17]
10  2013    10       [28,889 x 17]
11  2013    11       [27,268 x 17]
12  2013    12       [28,135 x 17]

In step 2, we add the most popular flight destinations.在第 2 步中,我们添加最受欢迎的航班目的地。

Output after step 2步骤 2 后的输出

# A tibble: 12 x 4
# Rowwise:  year, month
    year month                data topDest  
   <int> <int> <list<tibble[,17]>> <list>   
 1  2013     1       [27,004 x 17] <chr [5]>
 2  2013     2       [24,951 x 17] <chr [5]>
 3  2013     3       [28,834 x 17] <chr [5]>
 4  2013     4       [28,330 x 17] <chr [5]>
 5  2013     5       [28,796 x 17] <chr [5]>
 6  2013     6       [28,243 x 17] <chr [5]>
 7  2013     7       [29,425 x 17] <chr [5]>
 8  2013     8       [29,327 x 17] <chr [5]>
 9  2013     9       [27,574 x 17] <chr [5]>
10  2013    10       [28,889 x 17] <chr [5]>
11  2013    11       [27,268 x 17] <chr [5]>
12  2013    12       [28,135 x 17] <chr [5]>

In the last step, we add the carrierToToDest variable to the data variable, which determines whether the flight was going to one of the ntop places from the given month.在最后一步中,我们将carrierToToDest变量添加到data变量中,该变量确定航班是否前往给定月份的ntop地点之一。

Output after step 3第 3 步后的输出

# A tibble: 12 x 4
# Rowwise:  year, month
    year month data                   topDest  
   <int> <int> <list>                 <list>   
 1  2013     1 <tibble [27,004 x 18]> <chr [5]>
 2  2013     2 <tibble [24,951 x 18]> <chr [5]>
 3  2013     3 <tibble [28,834 x 18]> <chr [5]>
 4  2013     4 <tibble [28,330 x 18]> <chr [5]>
 5  2013     5 <tibble [28,796 x 18]> <chr [5]>
 6  2013     6 <tibble [28,243 x 18]> <chr [5]>
 7  2013     7 <tibble [29,425 x 18]> <chr [5]>
 8  2013     8 <tibble [29,327 x 18]> <chr [5]>
 9  2013     9 <tibble [27,574 x 18]> <chr [5]>
10  2013    10 <tibble [28,889 x 18]> <chr [5]>
11  2013    11 <tibble [27,268 x 18]> <chr [5]>
12  2013    12 <tibble [28,135 x 18]> <chr [5]>

How now we can see the most popular places.现在我们如何才能看到最受欢迎的地方。 Let's do this:我们开工吧:

df %>% mutate(topDest = paste(topDest, collapse = " "))

output输出

# A tibble: 12 x 4
# Rowwise:  year, month
    year month data                   topDest            
   <int> <int> <list>                 <chr>              
 1  2013     1 <tibble [27,004 x 18]> ATL ORD BOS MCO FLL
 2  2013     2 <tibble [24,951 x 18]> ATL ORD BOS MCO FLL
 3  2013     3 <tibble [28,834 x 18]> ATL ORD BOS MCO FLL
 4  2013     4 <tibble [28,330 x 18]> ATL ORD LAX BOS MCO
 5  2013     5 <tibble [28,796 x 18]> ORD ATL LAX BOS SFO
 6  2013     6 <tibble [28,243 x 18]> ORD ATL LAX BOS SFO
 7  2013     7 <tibble [29,425 x 18]> ORD ATL LAX BOS CLT
 8  2013     8 <tibble [29,327 x 18]> ORD ATL LAX BOS SFO
 9  2013     9 <tibble [27,574 x 18]> ORD LAX ATL BOS CLT
10  2013    10 <tibble [28,889 x 18]> ORD ATL LAX BOS CLT
11  2013    11 <tibble [27,268 x 18]> ATL ORD LAX BOS CLT
12  2013    12 <tibble [28,135 x 18]> ATL LAX MCO ORD CLT

Can we see flights to these destinations?我们能看到飞往这些目的地的航班吗? Of course, it's not difficult.当然,这并不难。

df %>% select(-topDest) %>% 
  unnest(data) %>% 
  filter(carrierToToDest) %>% 
  select(year, month, flight, carrier, dest) 

Output输出

# A tibble: 80,941 x 5
# Groups:   year, month [12]
    year month flight carrier dest 
   <int> <int>  <int> <chr>   <chr>
 1  2013     1    461 DL      ATL  
 2  2013     1   1696 UA      ORD  
 3  2013     1    507 B6      FLL  
 4  2013     1     79 B6      MCO  
 5  2013     1    301 AA      ORD  
 6  2013     1   1806 B6      BOS  
 7  2013     1    371 B6      FLL  
 8  2013     1   4650 MQ      ATL  
 9  2013     1   1743 DL      ATL  
10  2013     1   3768 MQ      ORD  
# ... with 80,931 more rows

This is my recipe.这是我的食谱。 Very simple and transparent in my opinion.在我看来非常简单和透明。 I would be extremely obligated if you would try it on your data and let me know with efficiency.如果您能在您的数据上尝试并高效地告诉我,我将非常有义务。

Small update小更新

I just noticed that I wanted to group not only after year (although you don't mention it, it must be so), month , but also by the carrier variable.我只是注意到我不仅要在year之后分组(虽然你没有提到它,但必须如此), month ,而且还要按carrier变量分组。 So let's add it as another grouping variable.因此,让我们将其添加为另一个分组变量。

df = flights %>% nest_by(year, month, carrier) %>%  
  mutate(topDest = list(ftopDest(data, 5)),  
         data = list(carrierToTopDest(data, topDest)))  

output输出

# A tibble: 185 x 5
# Rowwise:  year, month, carrier
    year month carrier data                  topDest  
   <int> <int> <chr>   <list>                <list>   
 1  2013     1 9E      <tibble [1,573 x 17]> <chr [5]>
 2  2013     1 AA      <tibble [2,794 x 17]> <chr [5]>
 3  2013     1 AS      <tibble [62 x 17]>    <chr [5]>
 4  2013     1 B6      <tibble [4,427 x 17]> <chr [5]>
 5  2013     1 DL      <tibble [3,690 x 17]> <chr [5]>
 6  2013     1 EV      <tibble [4,171 x 17]> <chr [5]>
 7  2013     1 F9      <tibble [59 x 17]>    <chr [5]>
 8  2013     1 FL      <tibble [328 x 17]>   <chr [5]>
 9  2013     1 HA      <tibble [31 x 17]>    <chr [5]>
10  2013     1 MQ      <tibble [2,271 x 17]> <chr [5]>
# ... with 175 more rows

Now let's get acquainted with the new top 5 directions.现在让我们熟悉新的前 5 个方向。

df %>% mutate(topDest = paste(topDest, collapse = " "))

output输出

# A tibble: 185 x 5
# Rowwise:  year, month, carrier
    year month carrier data                  topDest            
   <int> <int> <chr>   <list>                <chr>              
 1  2013     1 9E      <tibble [1,573 x 17]> BOS PHL CVG MSP ORD
 2  2013     1 AA      <tibble [2,794 x 17]> DFW MIA ORD LAX BOS
 3  2013     1 AS      <tibble [62 x 17]>    SEA NA NA NA NA    
 4  2013     1 B6      <tibble [4,427 x 17]> FLL MCO BOS PBI SJU
 5  2013     1 DL      <tibble [3,690 x 17]> ATL DTW MCO FLL MIA
 6  2013     1 EV      <tibble [4,171 x 17]> IAD DTW DCA RDU CVG
 7  2013     1 F9      <tibble [59 x 17]>    DEN NA NA NA NA    
 8  2013     1 FL      <tibble [328 x 17]>   ATL CAK MKE NA NA  
 9  2013     1 HA      <tibble [31 x 17]>    HNL NA NA NA NA    
10  2013     1 MQ      <tibble [2,271 x 17]> RDU CMH ORD BNA ATL

Given my_flights_top_5_by_month and my_flights_raw , we can try the following data.table approach鉴于my_flights_top_5_by_monthmy_flights_raw ,我们可以尝试以下data.table方法

library(data.table)

dcast(
  setDT(my_flights_top_5_by_month)[
    setDT(my_flights_raw),
    on = .(month, dest)
  ][, n_obs := !is.na(n_obs)],
  carrier + month ~ dest,
  fun.aggregate = function(x) sum(x) > 0,
  value.var = "n_obs"
)

which gives这使

     carrier month   ABQ   ACK   ALB   ANC   ATL   AUS   AVL   BDL   BGR   BHM
  1:      9E     1 FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
  2:      9E     2 FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
  3:      9E     3 FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
  4:      9E     4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  5:      9E     5 FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 ---
181:      YV     8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182:      YV     9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183:      YV    10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184:      YV    11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185:      YV    12 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
       BNA   BOS   BQN   BTV   BUF   BUR   BWI   BZN   CAE   CAK   CHO   CHS
  1: FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  2: FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  3: FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  4: FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  5: FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 ---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
       CLE   CLT   CMH   CRW   CVG   DAY   DCA   DEN   DFW   DSM   DTW   EGE
  1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 ---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
       EYW   FLL   GRR   GSO   GSP   HDN   HNL   HOU   IAD   IAH   ILM   IND
  1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 ---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
       JAC   JAX   LAS   LAX   LEX   LGA   LGB   MCI   MCO   MDW   MEM   MHT
  1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 ---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
       MIA   MKE   MSN   MSP   MSY   MTJ   MVY   MYR   OAK   OKC   OMA   ORD
  1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
  2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
  3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
  4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
  5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 ---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
       ORF   PBI   PDX   PHL   PHX   PIT   PSE   PSP   PVD   PWM   RDU   RIC
  1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 ---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
       ROC   RSW   SAN   SAT   SAV   SBN   SDF   SEA   SFO   SJC   SJU   SLC
  1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 ---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
       SMF   SNA   SRQ   STL   STT   SYR   TPA   TUL   TVC   TYS   XNA
  1: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  3: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  4: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 ---
181: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
182: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
183: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
184: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
185: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

I took a stub myself, using functions from the collapse package.我自己拿了一个存根,使用了collapse包中的函数。

library(magrittr)
library(collapse)
library(data.table)
  
my_flights_raw %>%
  collapse::funique() %>%
  collapse::fgroup_by(carrier, month) %>%
  collapse::fsummarise(nested_dest = list(dest)) %>%
  collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
  collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
  setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
  collapse::qTBL()

Unsurprisingly, collapse gives the fastest execution time.不出所料, collapse提供了最快的执行时间。 But I was surprised to see that @ThomasIsCoding's solution based on data.table was slower than my original tidyverse mix-and-match solution.但我很惊讶地看到,@ ThomasIsCoding的解决方案基于data.table较慢比我原来的tidyverse混合和匹配的解决方案。

I also factored in the single data.table dependency in Thomas's answer, compared to the variety of dependencies in my original method.与我原始方法中的各种依赖项相比,我还在 Thomas 的答案中考虑了单个data.table依赖项。

library(nycflights13)
library(dplyr, warn.conflicts = FALSE)

# OP original
my_flights_raw <-
  flights %>%
  select(carrier, month, dest)

my_flights_agg <- 
  my_flights_raw %>%
  count(month, dest, name = "n_obs") %>%
  arrange(month, desc(n_obs)) 

my_flights_top_dest_across_months <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5) %>%
  pull(dest) %>%
  unique()

my_flights_top_5_by_month <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5)

my_flights_top_dest_across_month <- unique(my_flights_top_5_by_month$dest)


op_slow <- function() {
  library(tidyr)
  library(tibble)
  library(purrr)
  
  my_flights_raw %>%
    group_by(carrier, month) %>%
    summarise(destinations_vec = list(unique(dest))) %>%
    add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
    mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
    mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month))  )) %>%
    tidyr::unnest_wider(are_top_dest_included)
}  


# OP collapse
op_collapse <- function() {
  library(magrittr)
  library(collapse)
  library(data.table)
  
  my_flights_raw %>%
    collapse::funique() %>%
    collapse::fgroup_by(carrier, month) %>%
    collapse::fsummarise(nested_dest = list(dest)) %>%
    collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
    collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
    setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
    collapse::qTBL()
}
  

# Thomas data.table
thomas_data.table <- function() {
  library(data.table)
  
  my_flights_top_dest_across_months <- 
    data.table(
      dest = unique(my_flights_top_5_by_month$dest),
      fd = 1
    )
  
  dcast(my_flights_top_dest_across_months[
    setDT(my_flights_raw),
    on = .(dest)
  ],
  carrier + month ~ dest,
  fun.aggregate = function(x) sum(x) > 0
  )[, c(
    "carrier", "month",
    my_flights_top_dest_across_months$dest
  ), with = FALSE]
}

output_op_slow <- op_slow()
output_op_collapse <- op_collapse()
output_thomas <- thomas_data.table()
#> Using 'month' as value column. Use 'value.var' to override

waldo::compare(output_op_slow, output_op_collapse, ignore_attr = TRUE)
#> v No differences
waldo::compare(output_op_slow, as_tibble(output_thomas), ignore_attr = TRUE) 
#> v No differences

bm <- bench::mark(op_slow = op_slow(),
            op_collapse = op_collapse(),
            thomas_dt = thomas_data.table(),
            check = FALSE,
            iterations = 100)

ggplot2::autoplot(bm)

Does this do what you want?这是你想要的吗? As far as I can tell it matches your output but has more rows because it includes all months for all carriers;据我所知,它与您的输出匹配,但有更多行,因为它包括所有运营商的所有月份; carrier "OO" only has flights in 5 months and your version only shows those 5 months in the summary. carrier “OO”只有 5 个月内的航班,您的版本仅在摘要中显示这 5 个月。

With the data as provided (336k rows), this takes a similar amount of time as your function, but it's faster as you deal with larger data.使用提供的数据(336k 行),这与您的函数花费的时间相似,但处理更大的数据时速度更快。 When I run these on data 100x as big after setting my_flights_raw <- my_flights_raw %>% tidyr::uncount(100) , to make it 33M rows, the code below is about 40% faster.当我在设置my_flights_raw <- my_flights_raw %>% tidyr::uncount(100)后在 100 倍大的数据上运行这些时,使其成为 33M 行,下面的代码快了大约 40%。

Given the large number of groups you're dealing with, I expect this is a situation where data.table will really shine with better performance.考虑到您要处理的大量组,我希望在这种情况下data.table将真正data.table更好的性能。

library(tidyverse)
my_flights_raw %>%
  count(carrier, month, dest) %>%
  complete(carrier, month, dest) %>%
  filter(dest %in% my_flights_top_dest_across_months) %>%
  mutate(n = if_else(!is.na(n), TRUE, FALSE)) %>%
  pivot_wider(names_from = dest, values_from = n) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM