简体   繁体   中英

R Arrow returns wrong column when multiple group_by / summarise

I have a query that has multiple group-by - summarise statements. When I ungroup the data between everything works fine, but if I don't one of the columns is replaced by another.

I would expect the columns to not be changed. For example in the examples below, the variable gender should be F , or M and not Group X

library(dplyr)
library(arrow)

# Create sample dataset
N <- 1000
set.seed(123)
orig_data <- tibble(
  code_group = sample(paste("Group", 1:2), N, replace = TRUE),
  year = sample(2015:2016, N, replace = TRUE),
  gender = sample(c("F", "M"), N, replace = TRUE),
  value = runif(N, 0, 10)
)
write_dataset(orig_data, "example")

# Query and replicate the error
(ds <- open_dataset("example/"))
#> FileSystemDataset with 1 Parquet file
#> code_group: string
#> year: int32
#> gender: string
#> value: double

ds |>
  group_by(year, code_group, gender) |>
  summarise(value = sum(value)) |>
  group_by(code_group, gender) |>
  summarise(value = max(value), NN = n()) |>
  collect()
#> # A tibble: 2 × 4
#> # Groups:   code_group [2]
#>   code_group gender  value    NN
#>   <chr>      <chr>   <dbl> <int>
#> 1 Group 1    Group 1  724.     4
#> 2 Group 2    Group 2  661.     4

ERROR the gender variable is replaced by the values of the group variable

ds |>
  group_by(year, code_group, gender) |>
  summarise(value = sum(value)) |>
  ungroup() |>                                             #< Added this line...
  group_by(code_group, gender) |>
  summarise(value = max(value), NN = n()) |>
  collect()
#> # A tibble: 4 × 4
#> # Groups:   code_group [2]
#>   code_group gender value    NN
#>   <chr>      <chr>  <dbl> <int>
#> 1 Group 1    F       724.     2
#> 2 Group 2    M       627.     2
#> 3 Group 1    M       658.     2
#> 4 Group 2    F       661.     2

Note now after inserting the ungroup() between the group-by - summarise calls, gender is not replaced

Quick look at the query (note Node 4 where "gender": code_group )

ds |>
  group_by(year, code_group, gender) |>
  summarise(value = sum(value)) |>
  group_by(code_group, gender) |>
  summarise(value = max(value), NN = n()) |> 
  show_query()
#> ExecPlan with 8 nodes:
#> 7:SinkNode{}
#>   6:ProjectNode{projection=[code_group, gender, value, NN]}
#>     5:GroupByNode{keys=["code_group", "gender"], aggregates=[
#>      hash_max(value, {skip_nulls=false, min_count=0}),
#>      hash_sum(NN, {skip_nulls=true, min_count=1}),
#>     ]}
#>       4:ProjectNode{projection=[value, "NN": 1, code_group, "gender": code_group]}
#>         3:ProjectNode{projection=[year, code_group, gender, value]}
#>           2:GroupByNode{keys=["year", "code_group", "gender"], aggregates=[
#>              hash_sum(value, {skip_nulls=false, min_count=0}),
#>           ]}
#>             1:ProjectNode{projection=[value, year, code_group, gender]}
#>               0:SourceNode{}

Created on 2022-12-07 by the reprex package (v2.0.1)

Do I have a wrong understanding of arrow/dplyr or is this a bug (if so is that in arrow or dplyr/dbplyr)?

Note that this was indeed a bug and had been closed with PR 14905 . It should work with the development version of arrow on GitHub.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM