简体   繁体   English

不使用 first() 的 Spark SQL 与普通 SQL 查询错误

[英]Spark SQL vs Normal SQL query error without using first()

I am trying to run a simple query inside Spark SQL but its throwing error unless I use first()我正在尝试在 Spark SQL 中运行一个简单的查询,但它抛出错误,除非我使用 first()

This query works normally with MySQL此查询在 MySQL 中正常工作

SELECT film.title,count(rental.rental_id) as total_rentals, film.rental_rate, count(rental.rental_id) * film.rental_rate as revenue
FROM rental
         INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
         INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1

But same doesn't with Spark SQL The error I am getting is :但同样不适用于 Spark SQL 我得到的错误是:

Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'film.`rental_rate`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;

Doing this actually fixes the problem这样做实际上可以解决问题

SELECT  film.title,count(rental.rental_id) as total_rentals, first(film.rental_rate), count(rental.rental_id) * first(film.rental_rate) as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1

Can some one explain why this is required in terms of Spark SQL ?有人可以解释为什么在 Spark SQL 方面需要这样做吗?

There is a common requirement in SQL that all non-aggregated columns in a group by query must appear in the group by clause. SQL 中有一个共同要求,即group by查询中的所有非聚合列都必须出现在group by子句中。 Some databases understand the concept of functionally dependent-column and let you get away with putting the primary key column only in the group by clause.一些数据库理解功能依赖列的概念,并允许您将主键列仅放在group by子句中。

I guess that title is not the primary key of film , so your original query is not valid standard SQL.我猜title不是film的主键,因此您的原始查询不是有效的标准 SQL。 I suspect that you are running this in MySQL, which (alas!) has options that allow disabling the standard requirement.我怀疑您正在 MySQL 中运行它,它(唉!)具有允许禁用标准要求的选项。

In a database that supports functional dependency in group by , you would phase the query as:group by中支持函数依赖的数据库中,您可以将查询分阶段为:

SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id
ORDER BY 1

I don't think Spark would understand that, so just add all the needed columns to the group by clause:我不认为 Spark 会理解这一点,所以只需将所有需要的列添加到group by子句中:

SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id, f.title, f.rental_rate
ORDER BY 1

Notes:笔记:

  • having film_id in the group by clause is still good practice;group by子句中film_id仍然是一个好习惯; in the real-life, two different movies might have the same title and rate, and you don't want to group them together在现实生活中,两部不同的电影可能具有相同的标题和评分,并且您不想将它们组合在一起

  • count(r.rental_id) can be simplified as count(*) (since obviously that column cannot be null count(r.rental_id)可以简化为count(*) (因为显然该列不能为null

  • table aliases make the queries easier to write and read表别名使查询更容易读写

I suspect that you want:我怀疑你想要:

SELECT f.title, COUNT(*) as total_rentals, f.rental_rate,  
       SUM(f.rental_rate) as revenue
FROM rental r JOIN
     inventory i
     ON r.inventory_id = i.inventory_id JOIN
     film f
     ON i.film_id = f.film_id
GROUP BY f.title, f.rental_rate
ORDER BY 1;

Notes:笔记:

  • In general, the GROUP BY columns should be the unaggregated columns in the SELECT .通常, GROUP BY列应该是SELECT的未聚合列。 That is even required (with the default settings) in MySQL for the past few years.在过去的几年里,MySQL 甚至需要(使用默认设置)。
  • You can just sum the rental_rate column.您可以对rental_rate列求和。 There is no need to count and multiply.无需计数和乘法。
  • Table aliases make the query easier to write and to read.表别名使查询更易于编写和阅读。

That the first SQL works in MySQL is because MySQL extends the SQL syntax to allow it.第一个 SQL 在 MySQL 中工作是因为 MySQL 扩展了 SQL 语法以允许它。 SparkSQL (in this case) is doing what just about every other database does. SparkSQL(在这种情况下)正在做几乎所有其他数据库都做的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM