I am trying to run a simple query inside Spark SQL but its throwing error unless I use first()
This query works normally with MySQL
SELECT film.title,count(rental.rental_id) as total_rentals, film.rental_rate, count(rental.rental_id) * film.rental_rate as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1
But same doesn't with Spark SQL The error I am getting is :
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'film.`rental_rate`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Doing this actually fixes the problem
SELECT film.title,count(rental.rental_id) as total_rentals, first(film.rental_rate), count(rental.rental_id) * first(film.rental_rate) as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1
Can some one explain why this is required in terms of Spark SQL ?
There is a common requirement in SQL that all non-aggregated columns in a group by
query must appear in the group by
clause. Some databases understand the concept of functionally dependent-column and let you get away with putting the primary key column only in the group by
clause.
I guess that title
is not the primary key of film
, so your original query is not valid standard SQL. I suspect that you are running this in MySQL, which (alas!) has options that allow disabling the standard requirement.
In a database that supports functional dependency in group by
, you would phase the query as:
SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id
ORDER BY 1
I don't think Spark would understand that, so just add all the needed columns to the group by
clause:
SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id, f.title, f.rental_rate
ORDER BY 1
Notes:
having film_id
in the group by
clause is still good practice; in the real-life, two different movies might have the same title and rate, and you don't want to group them together
count(r.rental_id)
can be simplified as count(*)
(since obviously that column cannot be null
table aliases make the queries easier to write and read
I suspect that you want:
SELECT f.title, COUNT(*) as total_rentals, f.rental_rate,
SUM(f.rental_rate) as revenue
FROM rental r JOIN
inventory i
ON r.inventory_id = i.inventory_id JOIN
film f
ON i.film_id = f.film_id
GROUP BY f.title, f.rental_rate
ORDER BY 1;
Notes:
GROUP BY
columns should be the unaggregated columns in the SELECT
. That is even required (with the default settings) in MySQL for the past few years.rental_rate
column. There is no need to count and multiply.That the first SQL works in MySQL is because MySQL extends the SQL syntax to allow it. SparkSQL (in this case) is doing what just about every other database does.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.