[英]Adding JDBC driver to Spark on EMR
I'm trying to add a JDBC driver to a Spark cluster that is executing on top Amazon EMR but I keep getting the:我正在尝试将 JDBC 驱动程序添加到在 Amazon EMR 上执行的 Spark 集群,但我不断收到:
java.sql.SQLException: No suitable driver found for exception. java.sql.SQLException:没有找到合适的异常驱动程序。
I tried the following things:我尝试了以下事情:
Could you please help me with that,how can I introduce the driver to the Spark cluster easily?你能帮我解决这个问题吗,我怎样才能轻松地将驱动程序引入 Spark 集群?
Thanks,谢谢,
David大卫
Source code of the application应用程序源代码
val properties = new Properties()
properties.put("ssl", "***")
properties.put("user", "***")
properties.put("password", "***")
properties.put("account", "***")
properties.put("db", "***")
properties.put("schema", "***")
properties.put("driver", "***")
val conf = new SparkConf().setAppName("***")
.setMaster("yarn-cluster")
.setJars(JavaSparkContext.jarOfClass(this.getClass()))
val sc = new SparkContext(conf)
sc.addJar(args(0))
val sqlContext = new SQLContext(sc)
var df = sqlContext.read.jdbc(connectStr, "***", properties = properties)
df = df.select( Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***)
// Additional actions on df
I had the same problem.我有同样的问题。 What ended working for me is to use the --driver-class-path parameter used with spark-submit.
对我有用的是使用与 spark-submit 一起使用的 --driver-class-path 参数。
The main thing is to add the entire spark class path to the --driver-class-path主要是将整个spark类路径添加到--driver-class-path
Here are my steps:这是我的步骤:
My driver class path ended up looking like this:我的驱动程序类路径最终看起来像这样:
--driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.35.jar:/etc/hadoop/conf:/usr/lib/hadoop/ :/usr/lib/hadoop-hdfs/ :/usr/lib/hadoop-mapreduce/ :/usr/lib/hadoop-yarn/ :/usr/lib/hadoop-lzo/lib/ :/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/ :/usr/share/aws/emr/emrfs/auxlib/*
--driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.35.jar:/etc/hadoop/conf:/usr/lib/hadoop/ :/usr/lib/hadoop-hdfs/ :/usr/lib/hadoop-mapreduce/ :/usr/lib/hadoop-yarn/ :/usr/lib/hadoop-lzo/lib/ :/usr/share/aws/emr/emrfs/conf:/usr/share /aws/emr/emrfs/lib/ :/usr/share/aws/emr/emrfs/auxlib/*
This worked with EMR 4.1 using Java with Spark 1.5.0.这适用于 EMR 4.1,使用 Java 和 Spark 1.5.0。 I had already added the MySQL JAR as a dependency in the Maven pom.xml
我已经在 Maven pom.xml 中添加了 MySQL JAR 作为依赖项
You may also want to look at this answer as it seems like a cleaner solution.您可能还想看看这个答案,因为它似乎是一个更清洁的解决方案。 I haven't tried it myself.
我自己没试过。
With EMR 5.2 I add any new jars to the original driver classpath
with:使用 EMR 5.2,我将任何新 jar 添加到原始驱动程序
classpath
中:
export MY_DRIVER_CLASS_PATH=my_jdbc_jar.jar:some_other_jar.jar$(grep spark.driver.extraClassPath /etc/spark/conf/spark-defaults.conf | awk '{print $2}')
and after that在那之后
spark-submit --driver-class-path $MY_DRIVER_CLASS_PATH
Following a similar pattern to this answer quoted above , this is how I automated installing a JDBC driver on EMR clusters.按照与上面引用的这个答案类似的模式,这就是我在 EMR 集群上自动安装 JDBC 驱动程序的方式。 (Full automation is useful for transient clusters started and terminated per job.)
(完全自动化对于每个作业启动和终止的临时集群很有用。)
aws s3 cp s3://.../your-jdbc-driver.jar /home/hadoop
/etc/spark/conf/spark-defaults.conf
/etc/spark/conf/spark-defaults.conf
This will be another one-line shell script, stored in S3:这将是另一个单行 shell 脚本,存储在 S3 中:
sudo sed -e 's,\(^spark.driver.extraClassPath.*$\),\1:/home/hadoop/your-jdbc-driver.jar,' -i /etc/spark/conf/spark-defaults.conf
The step itself will look like步骤本身看起来像
{
"name": "add JDBC driver to classpath",
"jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"args": ["s3://...bucket.../set-spark-driver-classpath.sh"]
}
This will add your JDBC driver to spark.driver.extraClassPath
这会将您的 JDBC 驱动程序添加到
spark.driver.extraClassPath
Explanation解释
you can't do both as bootstrap actions because Spark won't be installed yet, so no config file to update您不能同时执行这两种引导操作,因为 Spark 尚未安装,因此无需更新配置文件
you can't install the JDBC driver as a step, because you need the JDBC driver installed on the same path on all cluster nodes.您不能将 JDBC 驱动程序作为一个步骤安装,因为您需要将 JDBC 驱动程序安装在所有集群节点上的同一路径上。 In YARN cluster mode, the driver process does not necessarily run on the master node.
在 YARN 集群模式下,驱动进程不一定运行在主节点上。
The configuration only needs to be updated on the master node, though, as the config is packed up and shipped whatever node ends up running the driver.不过,配置只需要在主节点上更新,因为配置已打包并运送到最终运行驱动程序的任何节点。
In case you're using python in your EMR cluster there's no need for you to specify the jar while creating the cluster.如果您在 EMR 集群中使用 python,则无需在创建集群时指定 jar。 You can add the jar package while creating your SparkSession.
您可以在创建 SparkSession 时添加 jar 包。
spark = SparkSession \
.builder \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
.config("spark.jars.packages", "mysql:mysql-connector-java:8.0.17") \
.getOrCreate()
And then when you make your query mention the driver like this:然后,当您进行查询时,请像这样提及驱动程序:
form_df = spark.read.format("jdbc"). \
option("url", "jdbc:mysql://yourdatabase"). \
option("driver", "com.mysql.jdbc.Driver"). \
This way the package is included on the SparkSession as it is pulled from a maven repository.这样,在从 Maven 存储库中提取包时,该包就包含在 SparkSession 中。 I hope it helps someone that is on the same situation I once was.
我希望它可以帮助与我曾经处于相同情况的人。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.