简体   繁体   中英

How to get same grouping results using dplyr to get result consistent with sqldf result?

I try to implement SQL query using sqldf and dplyr.
I need to do this separately using these 2 different libraries.
Unfortunately, I cannot produce the same result using dplyr.

library(sqldf)
library(dplyr)

Id       <- c(1,2,3,4)
HasPet   <- c(0,0,1,1)
Age      <- c(20,1,14,10)

Posts <- data.frame(Id, HasPet, Age)

# sqldf way
ref <- sqldf("
      SELECT Id, HasPet, MAX(Age) AS MaxAge
      FROM Posts
      GROUP BY HasPet
  ")

# dplyr way
res <- Posts %>%
  group_by(HasPet) %>%
  summarize(
    Id,
    HasPet,
    MaxAge = max(Age)
    ) %>%
  select(Id, HasPet, MaxAge)

head(ref)
head(res)

Output for sqldf is:

> head(ref)
  Id HasPet MaxAge
1  1      0     20
2  3      1     14

while the output for sqldf is different:

> head(res)
# A tibble: 4 x 3
# Groups:   HasPet [2]
     Id HasPet MaxAge
  <dbl>  <dbl>  <dbl>
1     1      0     20
2     2      0     20
3     3      1     14
4     4      1     14

UPD. SQL query cannot be modified.

The code is not wrong, but the logic you are trying to achieve is it. Let me explain:

Your expected output for your grouping contains Id=1,3 . But how R will know are those and not Id=2,4 ?. More specific, when you are grouping by HasPet=0 , which value for Id will R choose? 1 or 2 ? How R will know it if you didn't give it specific criteria to use? That said, this gives your expected output:

res <- Posts %>%
  group_by(HasPet) %>%
  summarize(Id = min(Id),
            MaxAge = max(Age))

The answer to your question is that the SQL query is not doing the same thing as your R code version. Here is the equivalent SQL query:

SELECT Id, HasPet, MAX(Age) OVER (PARTITION BY HasPet) AS MaxAge
FROM Posts

Acutally, your current query is technically invalid, because it aggregates by HasPet , but selects the Id . It isn't clear which value of Id you want to select. Here is a valid version of your original query:

SELECT HasPet, MAX(Age) AS MaxAge
FROM Posts
GROUP BY HasPet

This problem can be solved by using:

slice(which.min(Id))

after "group_by" and "summarize" function calls.

For example:

# dplyr way
res <- Posts %>%
  group_by(HasPet) %>%
  summarize(
    Id,
    HasPet,
    MaxAge = max(Age)
    ) %>%
  select(Id, HasPet, MaxAge) %>%
  slice(which.min(Id))

In this case, output is the same as if to use dplyr:

> res
# A tibble: 2 x 3
# Groups:   HasPet [2]
     Id HasPet MaxAge
  <dbl>  <dbl>  <dbl>
1     1      0     20
2     3      1     14

PS I think there are simpler ways, but so far I have not found them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM