We were using a kerberized Hadoop environment (HDP 3.1.4 with Spark 2.3.2 and Ambari 2.7.4) for a long time, everything went well so far.
Now we enabled NameNode high availability and have the following issue: When we want to read data using Spark SQL, we first have to write some (other) data. If we don't write something before the read operation, it fails.
Here our scenario:
$ kinit -kt /etc/security/keytabs/user.keytab user
$ spark-shell
scala> spark.sql("SELECT * FROM pm.simulation_uci_hydraulic_sensor").show
Hive Session ID = cbb6b6e2-a048-41e0-8e77-c2b2a7f52dbe
[Stage 0:> (0 + 1) / 1]20/04/22 15:04:53 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, had-data6.my-company.de, executor 2): java.io.IOException: DestHost:destPort had-job.my-company.de:8020 , LocalHost:localPort had-data6.my-company.de/192.168.178.123:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:862)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:851)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:840)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1004)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:320)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:316)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:328)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:522)
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:364)
at org.apache.orc.OrcFile.createReader(OrcFile.java:251)
[...]
scala> val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS: org.apache.spark.sql.Dataset[Int] = [value: int]
scala> primitiveDS.write.saveAsTable("pm.todelete3")
20/04/22 15:05:07 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
scala> spark.sql("SELECT * FROM pm.simulation_uci_hydraulic_sensor").show
+--------+--------+--------------------+------+
|instance|sensorId| ts| value|
+--------+--------+--------------------+------+
| 21| PS6|2020-04-18 17:19:...| 8.799|
| 21| EPS1|2020-04-18 17:19:...|2515.6|
| 21| PS3|2020-04-18 17:19:...| 2.187|
+--------+--------+--------------------+------+
When running a new spark-shell
session, same behavior!
Can someone help with this issue? Thank you!
We found the answer for the problem: The table properties pointed to the "old" NameNode location in the table that was created before activating High Availability in the Hadoop cluster.
You can find table information by running the following command:
$ spark-shell
scala> spark.sql("DESCRIBE EXTENDED db.table").show(false)
This shows Table information like in my case:
+----------------------------+---------------------------------------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+---------------------------------------------------------------------------------------------+-------+
|instance |int |null |
|sensorId |string |null |
|ts |timestamp |null |
|value |double |null |
| | | |
|# Detailed Table Information| | |
|Database |simulation | |
|Table |uci_hydraulic_sensor_1 | | | |
|Created By |Spark 2.3.2.3.1.4.0-315 | |
|Type |EXTERNAL | |
|Provider |parquet | |
|Statistics |244762020 bytes | |
|Location |hdfs://had-job.mycompany.de:8020/projects/pm/simulation/uci_hydraulic_sensor_1 <== This is important!
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |
+----------------------------+---------------------------------------------------------------------------------------------+-------+
To set the new table location with the HA cluster service name run following SQL:
$ spark-shell
scala> spark.sql("ALTER TABLE simulation.uci_hydraulic_sensor_1 SET LOCATION 'hdfs://my-ha-name/projects/pm/simulation/uci_hydraulic_sensor_1'")
In further Spark sessions the table read works fine!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.