简体   繁体   English

Apache Spark-SQL 与 Sqoop 基准测试,同时将数据从 RDBMS 传输到 hdfs

[英]Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

I am working on a use case where I have to transfer data from RDBMS to HDFS.我正在研究一个用例,我必须将数据从 RDBMS 传输到 HDFS。 We have done the benchmarking of this case using sqoop and found out that we are able to transfer around 20GB data in 6-7 Mins.我们已经使用 sqoop 对这个案例进行了基准测试,发现我们能够在 6-7 分钟内传输大约 20GB 的数据。

Where as when I try the same with Spark SQL, the performance is very low(1 Gb of records is taking 4 min to transfer from netezza to hdfs).当我尝试使用 Spark SQL 时,性能非常低(1 Gb 的记录从 netezza 传输到 hdfs 需要 4 分钟)。 I am trying to do some tuning and increase its performance but its unlikely to tune it to the level of sqoop(around 3 Gb of data in 1 Min).我正在尝试进行一些调整并提高其性能,但不太可能将其调整到 sqoop 级别(1 分钟内大约 3 Gb 的数据)。

I agree to the fact that spark is primarily a processing engine but my main question is that both spark and sqoop are using JDBC driver internally so why there is so much difference in the performance(or may be I am missing something).我同意 spark 主要是一个处理引擎的事实,但我的主要问题是 spark 和 sqoop 都在内部使用 JDBC 驱动程序,所以为什么性能差异如此之大(或者我可能遗漏了什么)。 I am posting my code here.我在这里发布我的代码。

object helloWorld {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Netezza_Connection").setMaster("local")
    val sc= new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
    sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
    val df2 =sqlContext.sql("select * from POC")
    val partitioner= new org.apache.spark.HashPartitioner(14)
    val rdd=df2.rdd.map(x=>(String.valueOf(x.get(1)),x)).partitionBy(partitioner).values
    rdd.saveAsTextFile("hdfs://Hostname/test")
  }
}

I have checked many other post but could not get a clear answer for the internal working and tuning of sqoop nor I got sqoop vs spark sql benchmarking .Kindly help in understanding this issue.我已经检查了许多其他帖子,但无法得到关于 sqoop 的内部工作和调整的明确答案,也没有得到 sqoop 与 spark sql 基准测试。请帮助理解这个问题。

You are using the wrong tools for the job.您正在使用错误的工具来完成这项工作。

Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. Sqoop 将启动一系列进程(在数据节点上),每个进程都会连接到您的数据库(请参阅 num-mapper),并且每个进程都会提取数据集的一部分。 I don't think you can achieve kind of read parallelism with Spark.我不认为您可以使用 Spark 实现某种读取并行性。

Get the dataset with Sqoop and then process it with Spark.使用 Sqoop 获取数据集,然后使用 Spark 进行处理。

you can try the following:-您可以尝试以下操作:-

  1. Read data from netezza without any partitions and with increased fetch_size to a million.从没有任何分区的 netezza 读取数据,并将 fetch_size 增加到一百万。

     sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")
  2. repartition the data before writing it to final file.在将数据写入最终文件之前重新分区数据。

     val df3 = df2.repartition(10) //to reduce the shuffle
  3. ORC formats are more optimized than TEXT. ORC 格式比 TEXT 更优化。 Write the final output to parquet/ORC.将最终输出写入 parquet/ORC。

     df3.write.format("ORC").save("hdfs://Hostname/test")

@amitabh Although marked as an answer, I disagree with it. @amitabh 虽然标记为答案,但我不同意。

Once you give the predicate to partition the data while reading from the jdbc, spark will run separate tasks for each partition.一旦您在从 jdbc 读取数据时给出了对数据进行分区的谓词,spark 将为每个分区运行单独的任务。 In your case no of tasks should be 14 (u can confirm this using spark UI).在您的情况下,任务数不应为 14(您可以使用 spark UI 确认这一点)。

I notice that you are using local as master, which would provide only 1 core for executors.我注意到您使用 local 作为 master,它只会为执行程序提供 1 个核心。 Hence there will be no parallelism.因此不会有并行性。 Which is what is happening in your case.这就是你的情况。

Now to get the same throughput as sqoop you need to make sure that these tasks are running in parallel.现在要获得与 sqoop 相同的吞吐量,您需要确保这些任务并行运行。 Theoretically this can be done either by: 1. Using 14 executors with 1 core each 2. Using 1 executor with 14 cores (other end of the spectrum)理论上,这可以通过以下方式完成: 1. 使用 14 个执行器,每个执行器具有 1 个核心 2. 使用 1 个执行器和 14 个核心(频谱的另一端)

Typically, I would go with 4-5 cores per executor.通常,我会为每个执行程序使用 4-5 个内核。 So I test the performance with 15/5= 3 executors (i added 1 to 14 to consider 1 core for the driver running in clustor mode).因此,我使用 15/5= 3 个执行程序测试了性能(我添加了 1 到 14 个以考虑为在集群模式下运行的驱动程序使用 1 个内核)。 Use: executor.cores, executor.instances in sparkConf.set to play with the configs.使用:sparkConf.set 中的 executor.cores、executor.instances 来使用配置。

If this does not significantly increase performance, the next thing would be to look at the executor memory.如果这不会显着提高性能,那么接下来就是查看执行程序内存。

Finally, I would tweak the application logic to look at mapRDD sizes, partition sizes and shuffle sizes.最后,我将调整应用程序逻辑以查看 mapRDD 大小、分区大小和 shuffle 大小。

I had the same problem because the piece of code you are using it's not working for partition.我遇到了同样的问题,因为您使用的代码段不适用于分区。

sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")

You can check number of partitions created in you spark job by您可以通过以下方式检查在您的火花作业中创建的分区数

df.rdd.partitions.length

you can use following code to connect db:您可以使用以下代码连接数据库:

sqlContext.read.jdbc(url=db_url,
    table=tableName,
    columnName="ID",
    lowerBound=1L,
    upperBound=100000L,
    numPartitions=numPartitions,
    connectionProperties=connectionProperties) 

To optimize your spark job following are the parameters: 1. # of partitions 2. --num-executors 3.--executor-cores 4. --executor-memory 5. --driver-memory 6. fetch-size要优化您的 Spark 作业,请使用以下参数: 1. 分区数 2. --num-executors 3.--executor-cores 4. --executor-memory 5. --driver-memory 6. fetch-size

2,3,4 and 5 options are depends on you cluster configurations you can monitor your spark job on spark ui. 2,3,4 和 5 选项取决于您的集群配置,您可以在 spark ui 上监控您的 spark 作业。

The below solution helped me以下解决方案帮助了我

var df=spark.read.format("jdbc").option("url","
"url").option("user","user").option("password","password").option("dbTable","dbTable").option("fetchSize","10000").load()
df.registerTempTable("tempTable")
var dfRepart=spark.sql("select * from tempTable distribute by primary_key") //this will repartition the data evenly

dfRepart.write.format("parquet").save("hdfs_location")

Sqoop and Spark SQL both use JDBC connectivity to fetch the data from RDBMS engines but Sqoop has an edge here since it is specifically made to migrate the data between RDBMS and HDFS. Sqoop 和 Spark SQL 都使用 JDBC 连接从 RDBMS 引擎获取数据,但 Sqoop 在这方面有优势,因为它专门用于在 RDBMS 和 HDFS 之间迁移数据。

Every single option available in Sqoop has been fine-tuned to get the best performance while doing the data ingestions. Sqoop 中可用的每个选项都经过微调,以便在进行数据摄取时获得最佳性能。

You can start with discussing the option -m which control the number of mappers.您可以从讨论控制映射器数量的选项 -m 开始。

This is what you need to do to fetch data in parallel from RDBMS.这是从 RDBMS 并行获取数据所需的操作。 Can I do it in Spark SQL?我可以在 Spark SQL 中做到吗? Of course yes but the developer would need to take care of "multithreading" that Sqoop has been taking care automatically.当然可以,但开发人员需要处理 Sqoop 一直在自动处理的“多线程”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM