简体   繁体   English

Spark SQL为JDBC查询生成错误的上下限

[英]Spark SQL Generating Wrong Upper and Lower Bounds for JDBC Queries

So i am currently working on a POC With Spark-SQL Where i need to parallelize the read operation using a spark-sql query in spark 所以我目前正在使用Spark-SQL进行POC工作,我需要在spark中使用spark-sql查询来并行化读取操作

 JavaRDD<Row> dataset = sqlContext.read().jdbc(jdBcConnectionString, getSqlQuery(), "tran_id"
                lowerbound, upperbound, partitions, props).toJavaRDD();

Every seems well and works fine untill you inspect the queries generated (Which in my own case is MS Sql Server). 在您检查生成的查询之前,每个方法似乎都运行良好,并且工作正常(在我看来,这是MS Sql Server)。

The lower bound query is 下限查询是

exec sp_executesql N'SELECT * FROM table_name WHERE tran_id < 770425 or post_tran_id is null'

while the upperbound query becomes 而上限查询变为

exec sp_executesql N'SELECT * FROM table_name WHERE tran_id >= 770425'

One would think that the essence of specifying bounds is to get all rows where column value is between the specified lowerbound and upperbound. 有人认为,指定范围的本质是获取列值介于指定的下限和上限之间的所有行。 but this appears not to be the case 但事实并非如此

Please I am new to spark, is there another way to achieve this 请我是新来的火花,还有另一种方法可以实现这一目标

One would think that the essence of specifying bounds is to get all rows where column value is between the specified lowerbound and upperbound. 有人认为,指定范围的本质是获取列值介于指定的下限和上限之间的所有行。

It is not and conditions are correct. 不是,条件正确。 As usual it is better to read the documentation than assume: 像往常一样,阅读文档比假设假设要好:

Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. 请注意,lowerBound和upperBound仅用于确定分区步幅,而不是用于过滤表中的行。 So all rows in the table will be partitioned and returned. 因此,表中的所有行都将被分区并返回。 This option applies only to reading. 此选项仅适用于阅读。

But it looks like lowerBound is equal to upperBound in your case. 但是在您的情况下, lowerBound似乎等于upperBound

Please I am new to spark, is there another way to achieve this 请我是新来的火花,还有另一种方法可以实现这一目标

If you want filter then apply where : 如果要过滤where以下位置应用:

dataset.where(col("tran_id").between(lowerBound, upperBound))

or use subquery as table argument: 或使用子查询作为表参数:

sqlContext.read().jdbc(
  jdBcConnectionString,
  "(SELECT * FROM table_name WHERE tran_id BETWEEN 0 AND 42) AS t", props);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM