Spark SQL为JDBC查询生成错误的上下限

Question

So i am currently working on a POC With Spark-SQL Where i need to parallelize the read operation using a spark-sql query in spark 所以我目前正在使用Spark-SQL进行POC工作，我需要在spark中使用spark-sql查询来并行化读取操作

 JavaRDD<Row> dataset = sqlContext.read().jdbc(jdBcConnectionString, getSqlQuery(), "tran_id"
                lowerbound, upperbound, partitions, props).toJavaRDD();

Every seems well and works fine untill you inspect the queries generated (Which in my own case is MS Sql Server). 在您检查生成的查询之前，每个方法似乎都运行良好，并且工作正常（在我看来，这是MS Sql Server）。

The lower bound query is 下限查询是

exec sp_executesql N'SELECT * FROM table_name WHERE tran_id < 770425 or post_tran_id is null'

while the upperbound query becomes 而上限查询变为

exec sp_executesql N'SELECT * FROM table_name WHERE tran_id >= 770425'

One would think that the essence of specifying bounds is to get all rows where column value is between the specified lowerbound and upperbound. 有人认为，指定范围的本质是获取列值介于指定的下限和上限之间的所有行。 but this appears not to be the case 但事实并非如此

Please I am new to spark, is there another way to achieve this 请我是新来的火花，还有另一种方法可以实现这一目标

Answer 1

One would think that the essence of specifying bounds is to get all rows where column value is between the specified lowerbound and upperbound. 有人认为，指定范围的本质是获取列值介于指定的下限和上限之间的所有行。

It is not and conditions are correct. 不是，条件正确。 As usual it is better to read the documentation than assume: 像往常一样，阅读文档比假设假设要好：

Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. 请注意，lowerBound和upperBound仅用于确定分区步幅，而不是用于过滤表中的行。 So all rows in the table will be partitioned and returned. 因此，表中的所有行都将被分区并返回。 This option applies only to reading. 此选项仅适用于阅读。

But it looks like lowerBound is equal to upperBound in your case. 但是在您的情况下， lowerBound似乎等于upperBound 。

Please I am new to spark, is there another way to achieve this 请我是新来的火花，还有另一种方法可以实现这一目标

If you want filter then apply where : 如果要过滤where以下位置应用：

dataset.where(col("tran_id").between(lowerBound, upperBound))

or use subquery as table argument: 或使用子查询作为表参数：

sqlContext.read().jdbc(
  jdBcConnectionString,
  "(SELECT * FROM table_name WHERE tran_id BETWEEN 0 AND 42) AS t", props);

Spark SQL为JDBC查询生成错误的上下限

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-12-07 18:13:14

Spark SQL为JDBC查询生成错误的上下限

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-12-07 18:13:14

解决方案1
2 已采纳 2017-12-07 18:13:14