I try to implement SQL query using sqldf and dplyr.
I need to do this separately using these 2 different libraries.
Unfortunately, I cannot produce the same result using dplyr.
library(sqldf)
library(dplyr)
Id <- c(1,2,3,4)
HasPet <- c(0,0,1,1)
Age <- c(20,1,14,10)
Posts <- data.frame(Id, HasPet, Age)
# sqldf way
ref <- sqldf("
SELECT Id, HasPet, MAX(Age) AS MaxAge
FROM Posts
GROUP BY HasPet
")
# dplyr way
res <- Posts %>%
group_by(HasPet) %>%
summarize(
Id,
HasPet,
MaxAge = max(Age)
) %>%
select(Id, HasPet, MaxAge)
head(ref)
head(res)
Output for sqldf is:
> head(ref)
Id HasPet MaxAge
1 1 0 20
2 3 1 14
while the output for sqldf is different:
> head(res)
# A tibble: 4 x 3
# Groups: HasPet [2]
Id HasPet MaxAge
<dbl> <dbl> <dbl>
1 1 0 20
2 2 0 20
3 3 1 14
4 4 1 14
UPD. SQL query cannot be modified.
The code is not wrong, but the logic you are trying to achieve is it. Let me explain:
Your expected output for your grouping contains Id=1,3
. But how R will know are those and not Id=2,4
?. More specific, when you are grouping by HasPet=0
, which value for Id
will R choose? 1
or 2
? How R will know it if you didn't give it specific criteria to use? That said, this gives your expected output:
res <- Posts %>%
group_by(HasPet) %>%
summarize(Id = min(Id),
MaxAge = max(Age))
The answer to your question is that the SQL query is not doing the same thing as your R code version. Here is the equivalent SQL query:
SELECT Id, HasPet, MAX(Age) OVER (PARTITION BY HasPet) AS MaxAge
FROM Posts
Acutally, your current query is technically invalid, because it aggregates by HasPet
, but selects the Id
. It isn't clear which value of Id
you want to select. Here is a valid version of your original query:
SELECT HasPet, MAX(Age) AS MaxAge
FROM Posts
GROUP BY HasPet
This problem can be solved by using:
slice(which.min(Id))
after "group_by" and "summarize" function calls.
For example:
# dplyr way
res <- Posts %>%
group_by(HasPet) %>%
summarize(
Id,
HasPet,
MaxAge = max(Age)
) %>%
select(Id, HasPet, MaxAge) %>%
slice(which.min(Id))
In this case, output is the same as if to use dplyr:
> res
# A tibble: 2 x 3
# Groups: HasPet [2]
Id HasPet MaxAge
<dbl> <dbl> <dbl>
1 1 0 20
2 3 1 14
PS I think there are simpler ways, but so far I have not found them.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.