spark、scala 和 jdbc - 如何限制记录数

Question

Is there a way to limit the number of records fetched from the jdbc source using spark sql 2.2.0?有没有办法限制使用 spark sql 2.2.0 从 jdbc 源获取的记录数？

I am dealing with a task of moving (and transforming) a large number of records >200M from one MS Sql Server table to another:我正在处理将 > 200M 的大量记录从一个 MS Sql Server 表移动（和转换）到另一个表的任务：

val spark = SparkSession
    .builder()
    .appName("co.smith.copydata")
    .getOrCreate()

val sourceData = spark
    .read
    .format("jdbc")
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("url", jdbcSqlConnStr)
    .option("dbtable", sourceTableName)
    .load()
    .take(limit)

While it works, it is clearly first loading all the 200M records from the database first taking its sweet 18 min and then returns me the limited number of records I desire for testing and development purposes.当它工作时，它显然首先从数据库中加载所有 200M 记录，首先花费 18 分钟，然后返回我想要用于测试和开发目的的有限数量的记录。

Switching around take(...) and load() produces compilation error.切换 take(...) 和 load() 会产生编译错误。

I appreciate there are ways to copy sample data to a smaller table, use SSIS, or alternative etl tools.我很欣赏有多种方法可以将示例数据复制到较小的表、使用 SSIS 或替代 etl 工具。

I am really curious whether there is a way to achieve my goal using spark, sql and jdbc.我真的很好奇是否有办法使用 spark、sql 和 jdbc 来实现我的目标。

Answer 1

To limit the number of downloaded rows, a SQL query can be used instead of the table name in "dbtable".要限制下载的行数，可以使用 SQL 查询代替“dbtable”中的表名。 Description in documentation. 文档中的说明。

In query "where" condition can be specified, for example, with server specific features to limit the number of rows (like "rownum" in Oracle).在查询中可以指定“where”条件，例如，使用服务器特定的功能来限制行数（如 Oracle 中的“rownum”）。

Answer 2

This approach is a little bit bad for relational databases.这种方法对关系数据库有点不利。 The load function of spark will request your full table, store in memory/disk and then will do the RDD transformations and executions. spark的加载功能会请求你的全表，存储在内存/磁盘中，然后进行RDD转换和执行。

If you want to do an exploratory work, I will suggest you to store this data in your first load.如果你想做一个探索性的工作，我会建议你在第一次加载时存储这些数据。 There a few ways to do that.有几种方法可以做到这一点。 Take your code and do like this:拿你的代码做这样的事情：

val sourceData = spark
    .read
    .format("jdbc")
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("url", jdbcSqlConnStr)
    .option("dbtable", sourceTableName)
    .load()
sourceData.write
    .option("header", "true")
    .option("delimiter", ",")
    .format("csv")
    .save("your_path")

This will allow you to save your data in your local machine as CSV, the most common format that you can work with any language for exploration.这将允许您将数据以 CSV 格式保存在本地计算机中，这是您可以使用任何语言进行探索的最常见格式。 Everytime that you want to load this, take this data from this file.每次你想加载它时，从这个文件中获取这些数据。 If you want real time analysis, or any other thing like this.如果您想要实时分析，或任何其他类似的东西。 I will suggest you build a pipeline with the transformations of the data to update another storage.我建议您使用数据转换构建一个管道以更新另一个存储。 Using this approach to process your data of loading from your db every time is not good.每次都使用这种方法来处理从数据库加载的数据并不好。

Answer 3

I have not tested this, but you should try using limit instead of take .我没有测试过这个，但你应该尝试使用limit而不是take 。 take calls head under the covers which has the following note: take电话head具有下列注意事项在幕后：

this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.仅当预期结果数组较小时才应使用此方法，因为所有数据都已加载到驱动程序的内存中。

whereas limit results in a LIMIT pushed into the sql query as it is a lazy evaluation:而limit导致 LIMIT 被推入 sql 查询，因为它是一个懒惰的评估：

The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.这个函数和head的区别在于head是一个 action 并返回一个数组（通过触发查询执行），而limit返回一个新的 Dataset。

If you want the data without pulling it in first then you could even do something like:如果您想要数据而不先将其拉入，那么您甚至可以执行以下操作：

...load.limit(limitNum).take(limitNum)

spark、scala 和 jdbc - 如何限制记录数

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-10-28 18:37:22

解决方案2
1 2017-10-28 00:29:58

解决方案3
0 2017-11-01 03:07:43

spark、scala 和 jdbc - 如何限制记录数

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-10-28 18:37:22

解决方案2 1 2017-10-28 00:29:58

解决方案3 0 2017-11-01 03:07:43

解决方案1
2 已采纳 2017-10-28 18:37:22

解决方案2
1 2017-10-28 00:29:58

解决方案3
0 2017-11-01 03:07:43