简体   繁体   中英

How to spark-submit remotely to EMR as Client mode?

I have a ECS task configured to run spark-submit to EMR Cluster. The spark-submit is configured as Yarn Cluster mode.

My streaming application is suppose to save data to Redshift on an RDD, but I'm getting this error:

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at com.databricks.spark.redshift.Utils$.assertThatFileSystemIsNotS3BlockFileSystem(Utils.scala:162)
    at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:386)
    at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:108)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    ...

I suspect that because "spark.yarn.jars" was not set so it pushed my remote server's $SPARK_HOME libraries over and it's missing the jars for com.amazon.ws.emr.hadoop.fs.EmrFileSystem.

So, I also attempted to set "spark.yarn.jars=hdfs://nodename:8020/user/spark/jars/*.jar" after I copied EMR's masternode's /usr/lib/spark/jars/* over. Then it errors:

java.io.InvalidClassException: org.apache.spark.sql.execution.SparkPlan; local class incompatible: stream classdesc serialVersionUID = -7931627949087445875, local class serialVersionUID = -5425351703039338847

I think there may be a mismatch in jars between the remote client's jars to EMR's clusters' jars. But they're both version 2.4.7.

Anyone have any clever solution to get my streaming spark-submit job working in EMR as yarn client mode?

The binaries needs to be the same as those in EMR Cluster.

This resource helped me resolve this issue: https://docs.dominodatalab.com/en/4.5.2/reference/spark/external_spark/Connecting_to_an_Amazon_EMR_cluster_from_Domino.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM