I have a ECS task configured to run spark-submit to EMR Cluster. The spark-submit is configured as Yarn Cluster mode.
My streaming application is suppose to save data to Redshift on an RDD, but I'm getting this error:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at com.databricks.spark.redshift.Utils$.assertThatFileSystemIsNotS3BlockFileSystem(Utils.scala:162)
at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:386)
at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:108)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
...
I suspect that because "spark.yarn.jars" was not set so it pushed my remote server's $SPARK_HOME libraries over and it's missing the jars for com.amazon.ws.emr.hadoop.fs.EmrFileSystem.
So, I also attempted to set "spark.yarn.jars=hdfs://nodename:8020/user/spark/jars/*.jar"
after I copied EMR's masternode's /usr/lib/spark/jars/* over. Then it errors:
java.io.InvalidClassException: org.apache.spark.sql.execution.SparkPlan; local class incompatible: stream classdesc serialVersionUID = -7931627949087445875, local class serialVersionUID = -5425351703039338847
I think there may be a mismatch in jars between the remote client's jars to EMR's clusters' jars. But they're both version 2.4.7.
Anyone have any clever solution to get my streaming spark-submit job working in EMR as yarn client mode?
The binaries needs to be the same as those in EMR Cluster.
This resource helped me resolve this issue: https://docs.dominodatalab.com/en/4.5.2/reference/spark/external_spark/Connecting_to_an_Amazon_EMR_cluster_from_Domino.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.