简体   繁体   中英

Spark SQL vs Normal SQL query error without using first()

I am trying to run a simple query inside Spark SQL but its throwing error unless I use first()

This query works normally with MySQL

SELECT film.title,count(rental.rental_id) as total_rentals, film.rental_rate, count(rental.rental_id) * film.rental_rate as revenue
FROM rental
         INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
         INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1

But same doesn't with Spark SQL The error I am getting is :

Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'film.`rental_rate`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;

Doing this actually fixes the problem

SELECT  film.title,count(rental.rental_id) as total_rentals, first(film.rental_rate), count(rental.rental_id) * first(film.rental_rate) as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1

Can some one explain why this is required in terms of Spark SQL ?

There is a common requirement in SQL that all non-aggregated columns in a group by query must appear in the group by clause. Some databases understand the concept of functionally dependent-column and let you get away with putting the primary key column only in the group by clause.

I guess that title is not the primary key of film , so your original query is not valid standard SQL. I suspect that you are running this in MySQL, which (alas!) has options that allow disabling the standard requirement.

In a database that supports functional dependency in group by , you would phase the query as:

SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id
ORDER BY 1

I don't think Spark would understand that, so just add all the needed columns to the group by clause:

SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id, f.title, f.rental_rate
ORDER BY 1

Notes:

  • having film_id in the group by clause is still good practice; in the real-life, two different movies might have the same title and rate, and you don't want to group them together

  • count(r.rental_id) can be simplified as count(*) (since obviously that column cannot be null

  • table aliases make the queries easier to write and read

I suspect that you want:

SELECT f.title, COUNT(*) as total_rentals, f.rental_rate,  
       SUM(f.rental_rate) as revenue
FROM rental r JOIN
     inventory i
     ON r.inventory_id = i.inventory_id JOIN
     film f
     ON i.film_id = f.film_id
GROUP BY f.title, f.rental_rate
ORDER BY 1;

Notes:

  • In general, the GROUP BY columns should be the unaggregated columns in the SELECT . That is even required (with the default settings) in MySQL for the past few years.
  • You can just sum the rental_rate column. There is no need to count and multiply.
  • Table aliases make the query easier to write and to read.

That the first SQL works in MySQL is because MySQL extends the SQL syntax to allow it. SparkSQL (in this case) is doing what just about every other database does.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM