So i am currently working on a POC With Spark-SQL Where i need to parallelize the read operation using a spark-sql query in spark
JavaRDD<Row> dataset = sqlContext.read().jdbc(jdBcConnectionString, getSqlQuery(), "tran_id"
lowerbound, upperbound, partitions, props).toJavaRDD();
Every seems well and works fine untill you inspect the queries generated (Which in my own case is MS Sql Server).
The lower bound query is
exec sp_executesql N'SELECT * FROM table_name WHERE tran_id < 770425 or post_tran_id is null'
while the upperbound query becomes
exec sp_executesql N'SELECT * FROM table_name WHERE tran_id >= 770425'
One would think that the essence of specifying bounds is to get all rows where column value is between the specified lowerbound and upperbound. but this appears not to be the case
Please I am new to spark, is there another way to achieve this
One would think that the essence of specifying bounds is to get all rows where column value is between the specified lowerbound and upperbound.
It is not and conditions are correct. As usual it is better to read the documentation than assume:
Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
But it looks like lowerBound
is equal to upperBound
in your case.
Please I am new to spark, is there another way to achieve this
If you want filter then apply where
:
dataset.where(col("tran_id").between(lowerBound, upperBound))
or use subquery as table argument:
sqlContext.read().jdbc(
jdBcConnectionString,
"(SELECT * FROM table_name WHERE tran_id BETWEEN 0 AND 42) AS t", props);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.