Read HDFS file in Spark Application in Kerberized Cluster

Question

I setup a Hadoop Cluster with Hortonworks Data Platform 2.5, which also includes Ambari 2.4, Kerberos, Spark 1.6.2 and HDFS.

I have eg the Kerberos principals and keytabs for the following users:

spark (created by Ambari during Kerberos enabling)
hdfsuserA (created by kadmin -> add_principle)

User spark is needed to run the spark-submit command in the secured cluster, and the Spark application must open some files in HDFS directory /user/hdfsuserA/... , which is owned by the hdfsuserA (700).

Since I enabled Kerberos, my Spark application won't run anymore, it fails with the following exception

[Stage 1:>     (0 + 92) / 162]Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost task 55.3 in stage 1.0 (TID 225, had-data1): org.apache.hadoop.security.AccessControlException: Permission denied: user=spark, access=EXECUTE, inode="/user/hdfsuserA/new/data/Export_PDM_Hadoop_05_2016.csv":hdfsuserA:hadoop:drwx------
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1827)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1811)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1785)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1862)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1831)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1744)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:693)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)

The issue is, that I authenticate with user spark to be able to start the Spark application, but inside the app, I get an exception as the /user/hdfsuserA HDFS directory is not accessable by the spark user.

When I run the spark-submit command with user hdfsuserA I get:

[hdfsuserA@had-job ~]$ kinit -kt /etc/security/keytabs/hdfsuserA.keytab hdfsuserA

[hdfsuserA@had-job ~]$ spark-submit --class spark.sales.TestAnalysis --master yarn --deploy-mode client /home/hdfsuserA/application_new.jar hdfs://had-job:8020/user/hdfsuserA/new/data/*
16/12/03 09:44:46 INFO Remoting: Starting remoting
16/12/03 09:44:46 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@141.79.71.34:46996]
spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
spark.driver.cores is set but does not apply in client mode.
16/12/03 09:44:49 INFO metastore: Trying to connect to metastore with URI thrift://had-job:9083
16/12/03 09:44:49 INFO metastore: Connected to metastore.
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:122)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
        at myutil.SparkContextFactory.createSparkContext(SparkContextFactory.java:34)
        at spark.sales.BasketBasedSalesAnalysis.main(BasketBasedSalesAnalysis.java:46)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

What is the correct solution for such an issue? Can I eg kinit for another user inside the app?

Answer 1

I found the problem: It was an user issue! As I only created the hdfsuserA on the NameNode host of my cluster from where I run the spark-submit command, the application was not able to authenticate as this user via keytabs on the other hosts.

So to solve this issue: Add the same user on all hosts of the cluster :

sudo useradd hdfsuserA
sudo passwd hdfsuserA

Calling the spark application should work afterwards (with master yarn parameter in spark-submit , with master local[x] it always worked)!

Read HDFS file in Spark Application in Kerberized Cluster

Question

1 answers

solution1
1 ACCPTED 2016-12-03 09:09:43

Read HDFS file in Spark Application in Kerberized Cluster

Question

1 answers

solution1 1 ACCPTED 2016-12-03 09:09:43

solution1
1 ACCPTED 2016-12-03 09:09:43