select latest partition of data from hive by sparksql

Question

I have a hive table partitoned by ds where ds is a string like 2017-11-07 . Since string is comparable, I want to select latest partition of data from hive by sparksql, so I wrote this code:

Dataset<Row> ds = sparkSession.sql("select max(ds) from admin_zpw123_user_meta");

String s = ds.first().getString(0);

sparkSession.sql("select * from admin_zpw123_user_meta where ds="+s).show();

I can print the string s, which is 2017-11-07 but I didn't get any output from the third statement. I want to know why and is there an elegent way to do this?

Answer 1

You need to have single quotes around the 2017-11-07 string when using it in the SQL statement. You can add it to the query like this:

sparkSession.sql("select * from admin_zpw123_user_meta where ds='" + s + "'").show();

Answer 2

我只是在2017年11月7日添加'' ，然后就可以使用了，但是还不够优雅

Answer 3

Actions are very inefficient in spark, and you have an useless one on:

String s = ds.first().getString(0);

To fix that, you can filter only the latest partition date simple by doing:

sparkSession.sql("select * from admin_zpw123_user_meta where ds in (select max(distinct ds) from admin_zpw123_user_meta)").show();

select latest partition of data from hive by sparksql

Question

3 answers

solution1
2 ACCPTED 2017-11-08 05:07:41

solution2
0 2017-11-08 04:38:15

solution3
0 2018-07-09 00:19:07

select latest partition of data from hive by sparksql

Question

3 answers

solution1 2 ACCPTED 2017-11-08 05:07:41

solution2 0 2017-11-08 04:38:15

solution3 0 2018-07-09 00:19:07

solution1
2 ACCPTED 2017-11-08 05:07:41

solution2
0 2017-11-08 04:38:15

solution3
0 2018-07-09 00:19:07