Spark sql version of the same query does not work whereas the normal sql query does

Question

The normal sql query :

    SELECT DISTINCT(county_geoid), state_geoid, sum(PredResponse), sum(prop_count) FROM table_a GROUP BY county_geoid;

Gives me an output. However the spark sql version of this same query used in pyspark is giving me an error. How to resolve this issue?

    result_county_performance_alpha = spark.sql("SELECT distinct(county_geoid), sum(PredResponse), sum(prop_count), state_geoid FROM table_a group by county_geoid")

This gives an error :

AnalysisException: u"expression 'tract_alpha.`state_geoid`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

How to resolve this issue?

Answer 1

Your "normal" query should not work anywhere. The correct way to write the query is:

SELECT county_geoid, state_geoid, sum(PredResponse), sum(prop_count)
FROM table_a
GROUP BY county_geoid, state_geoid;

This should work on any database (where the columns and tables are defined and of the right types).

Your version has state_geoid in the SELECT , but it is not being aggregated. That is not correct SQL. It might happen to work in MySQL, but that is due to a (mis)feature in the database (that is finally being fixed).

Also, you almost never want to use SELECT DISTINCT with GROUP BY . And, the parentheses after the DISTINCT make no difference. The construct is SELECT DISTINCT . DISTINCT is not a function.

Spark sql version of the same query does not work whereas the normal sql query does

Question

1 answers

solution1
3 ACCPTED 2017-07-05 11:31:37

Spark sql version of the same query does not work whereas the normal sql query does

Question

1 answers

solution1 3 ACCPTED 2017-07-05 11:31:37

solution1
3 ACCPTED 2017-07-05 11:31:37