简体   繁体   English

在 EMR 上将 JDBC 驱动程序添加到 Spark

[英]Adding JDBC driver to Spark on EMR

I'm trying to add a JDBC driver to a Spark cluster that is executing on top Amazon EMR but I keep getting the:我正在尝试将 JDBC 驱动程序添加到在 Amazon EMR 上执行的 Spark 集群,但我不断收到:

java.sql.SQLException: No suitable driver found for exception. java.sql.SQLException:没有找到合适的异常驱动程序。

I tried the following things:我尝试了以下事情:

  1. Use addJar to add the driver Jar explicitly from the code.使用 addJar 从代码中显式添加驱动程序 Jar。
  2. Using spark.executor.extraClassPath spark.driver.extraClassPath parameters.使用 spark.executor.extraClassPath spark.driver.extraClassPath 参数。
  3. Using spark.driver.userClassPathFirst=true, when I used this option I'm getting a different error because mix of dependencies with Spark, Anyway this option seems to be to aggressive if I just want to add a single JAR.使用 spark.driver.userClassPathFirst=true,当我使用这个选项时,我得到了一个不同的错误,因为与 Spark 的依赖关系混合在一起,无论如何,如果我只想添加一个 JAR,这个选项似乎过于激进。

Could you please help me with that,how can I introduce the driver to the Spark cluster easily?你能帮我解决这个问题吗,我怎样才能轻松地将驱动程序引入 Spark 集群?

Thanks,谢谢,

David大卫

Source code of the application应用程序源代码

val properties = new Properties()
properties.put("ssl", "***")
properties.put("user", "***")
properties.put("password", "***")
properties.put("account", "***")
properties.put("db", "***")
properties.put("schema", "***")
properties.put("driver", "***")

val conf = new SparkConf().setAppName("***")
      .setMaster("yarn-cluster")
      .setJars(JavaSparkContext.jarOfClass(this.getClass()))

val sc = new SparkContext(conf)
sc.addJar(args(0))
val sqlContext = new SQLContext(sc)

var df = sqlContext.read.jdbc(connectStr, "***", properties = properties)
df = df.select( Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***)
// Additional actions on df

I had the same problem.我有同样的问题。 What ended working for me is to use the --driver-class-path parameter used with spark-submit.对我有用的是使用与 spark-submit 一起使用的 --driver-class-path 参数。

The main thing is to add the entire spark class path to the --driver-class-path主要是将整个spark类路径添加到--driver-class-path

Here are my steps:这是我的步骤:

  1. I got the default driver class path by getting the value of the "spark.driver.extraClassPath" property from the Spark History Server under "Environment".我通过从 Spark History Server 的“Environment”下获取“spark.driver.extraClassPath”属性的值来获得默认的驱动程序类路径。
  2. Copied the MySQL JAR file to each node in the EMR cluster.将 MySQL JAR 文件复制到 EMR 集群中的每个节点。
  3. Put the MySQL jar path at the front of the --driver-class-path argument to the spark-submit command and append the value of "spark.driver.extraClassPath" to it将 MySQL jar 路径放在 spark-submit 命令的 --driver-class-path 参数的前面,并将“spark.driver.extraClassPath”的值附加到它

My driver class path ended up looking like this:我的驱动程序类路径最终看起来像这样:

--driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.35.jar:/etc/hadoop/conf:/usr/lib/hadoop/ :/usr/lib/hadoop-hdfs/ :/usr/lib/hadoop-mapreduce/ :/usr/lib/hadoop-yarn/ :/usr/lib/hadoop-lzo/lib/ :/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/ :/usr/share/aws/emr/emrfs/auxlib/* --driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.35.jar:/etc/hadoop/conf:/usr/lib/hadoop/ :/usr/lib/hadoop-hdfs/ :/usr/lib/hadoop-mapreduce/ :/usr/lib/hadoop-yarn/ :/usr/lib/hadoop-lzo/lib/ :/usr/share/aws/emr/emrfs/conf:/usr/share /aws/emr/emrfs/lib/ :/usr/share/aws/emr/emrfs/auxlib/*

This worked with EMR 4.1 using Java with Spark 1.5.0.这适用于 EMR 4.1,使用 Java 和 Spark 1.5.0。 I had already added the MySQL JAR as a dependency in the Maven pom.xml我已经在 Maven pom.xml 中添加了 MySQL JAR 作为依赖项

You may also want to look at this answer as it seems like a cleaner solution.您可能还想看看这个答案,因为它似乎是一个更清洁的解决方案。 I haven't tried it myself.我自己没试过。

With EMR 5.2 I add any new jars to the original driver classpath with:使用 EMR 5.2,我将任何新 jar 添加到原始驱动程序classpath中:

export MY_DRIVER_CLASS_PATH=my_jdbc_jar.jar:some_other_jar.jar$(grep spark.driver.extraClassPath /etc/spark/conf/spark-defaults.conf | awk '{print $2}')

and after that在那之后

spark-submit --driver-class-path $MY_DRIVER_CLASS_PATH

Following a similar pattern to this answer quoted above , this is how I automated installing a JDBC driver on EMR clusters.按照与上面引用的这个答案类似的模式,这就是我在 EMR 集群上自动安装 JDBC 驱动程序的方式。 (Full automation is useful for transient clusters started and terminated per job.) (完全自动化对于每个作业启动和终止的临时集群很有用。)

  • use a bootstrap action to install the JDBC driver on all EMR cluster nodes.使用引导操作在所有 EMR 集群节点上安装 JDBC 驱动程序。 Your bootstrap action will be a one-line shell script, stored in S3, that looks like您的引导操作将是一个存储在 S3 中的单行 shell 脚本,看起来像
aws s3 cp s3://.../your-jdbc-driver.jar /home/hadoop
  • add a step to your EMR cluster before running your actual Spark job, to modify /etc/spark/conf/spark-defaults.conf在运行实际的 Spark 作业之前向 EMR 集群添加一个步骤,以修改/etc/spark/conf/spark-defaults.conf

This will be another one-line shell script, stored in S3:这将是另一个单行 shell 脚本,存储在 S3 中:

sudo sed -e 's,\(^spark.driver.extraClassPath.*$\),\1:/home/hadoop/your-jdbc-driver.jar,' -i /etc/spark/conf/spark-defaults.conf

The step itself will look like步骤本身看起来像

{
    "name": "add JDBC driver to classpath",
    "jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
    "args": ["s3://...bucket.../set-spark-driver-classpath.sh"]
}

This will add your JDBC driver to spark.driver.extraClassPath这会将您的 JDBC 驱动程序添加到spark.driver.extraClassPath

Explanation解释

  • you can't do both as bootstrap actions because Spark won't be installed yet, so no config file to update您不能同时执行这两种引导操作,因为 Spark 尚未安装,因此无需更新配置文件

  • you can't install the JDBC driver as a step, because you need the JDBC driver installed on the same path on all cluster nodes.您不能将 JDBC 驱动程序作为一个步骤安装,因为您需要将 JDBC 驱动程序安装在所有集群节点上的同一路径上。 In YARN cluster mode, the driver process does not necessarily run on the master node.在 YARN 集群模式下,驱动进程不一定运行在主节点上。

  • The configuration only needs to be updated on the master node, though, as the config is packed up and shipped whatever node ends up running the driver.不过,配置只需要在主节点上更新,因为配置已打包并运送到最终运行驱动程序的任何节点。

In case you're using python in your EMR cluster there's no need for you to specify the jar while creating the cluster.如果您在 EMR 集群中使用 python,则无需在创建集群时指定 jar。 You can add the jar package while creating your SparkSession.您可以在创建 SparkSession 时添加 jar 包。

    spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
    .config("spark.jars.packages", "mysql:mysql-connector-java:8.0.17") \
    .getOrCreate()

And then when you make your query mention the driver like this:然后,当您进行查询时,请像这样提及驱动程序:

 form_df = spark.read.format("jdbc"). \
 option("url", "jdbc:mysql://yourdatabase"). \
 option("driver", "com.mysql.jdbc.Driver"). \

This way the package is included on the SparkSession as it is pulled from a maven repository.这样,在从 Maven 存储库中提取包时,该包就包含在 SparkSession 中。 I hope it helps someone that is on the same situation I once was.我希望它可以帮助与我曾经处于相同情况的人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM