kerberized Hadoop 环境中的 Spark 并启用了高可用性：Spark SQL 只能在写入任务后读取数据

Question

We were using a kerberized Hadoop environment (HDP 3.1.4 with Spark 2.3.2 and Ambari 2.7.4) for a long time, everything went well so far.我们使用了 Kerberized Hadoop 环境（HDP 3.1.4 与 Spark 2.3.2 和 Ambari 2.7.4）很长一段时间，到目前为止一切顺利。

Now we enabled NameNode high availability and have the following issue: When we want to read data using Spark SQL, we first have to write some (other) data.现在我们启用了 NameNode 高可用性并遇到以下问题：当我们想使用 Spark SQL 读取数据时，我们首先必须写入一些（其他）数据。 If we don't write something before the read operation, it fails.如果我们在读操作之前不写东西，它就会失败。

Here our scenario:这是我们的场景：

$ kinit -kt /etc/security/keytabs/user.keytab user
$ spark-shell

Run a Read request -> This first read request per session fails!运行读取请求 -> 每个 session 的第一个读取请求失败！

scala> spark.sql("SELECT * FROM pm.simulation_uci_hydraulic_sensor").show
Hive Session ID = cbb6b6e2-a048-41e0-8e77-c2b2a7f52dbe
[Stage 0:>                                                          (0 + 1) / 1]20/04/22 15:04:53 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, had-data6.my-company.de, executor 2): java.io.IOException: DestHost:destPort had-job.my-company.de:8020 , LocalHost:localPort had-data6.my-company.de/192.168.178.123:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
        at org.apache.hadoop.ipc.Client.call(Client.java:1444)
        at org.apache.hadoop.ipc.Client.call(Client.java:1354)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:862)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:851)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:840)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1004)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:320)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:316)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:328)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
        at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:522)
        at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:364)
        at org.apache.orc.OrcFile.createReader(OrcFile.java:251)
        [...]

Run a Write job -> This works!运行写入作业 -> 这有效！

scala> val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> primitiveDS.write.saveAsTable("pm.todelete3")
20/04/22 15:05:07 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

Now, do this the same read again -> It works (for same session)??现在，再次阅读相同的内容->它有效（对于同一会话）？？

scala> spark.sql("SELECT * FROM pm.simulation_uci_hydraulic_sensor").show
+--------+--------+--------------------+------+
|instance|sensorId|                  ts| value|
+--------+--------+--------------------+------+
|      21|     PS6|2020-04-18 17:19:...| 8.799|
|      21|    EPS1|2020-04-18 17:19:...|2515.6|
|      21|     PS3|2020-04-18 17:19:...| 2.187|
+--------+--------+--------------------+------+

When running a new spark-shell session, same behavior!运行新的spark-shell session 时，行为相同！

Can someone help with this issue?有人可以帮助解决这个问题吗？ Thank you!谢谢！

Answer 1

We found the answer for the problem: The table properties pointed to the "old" NameNode location in the table that was created before activating High Availability in the Hadoop cluster.我们找到了问题的答案：表属性指向表中的“旧”NameNode 位置，该位置是在 Hadoop 集群中激活高可用性之前创建的。

You can find table information by running the following command:您可以通过运行以下命令来查找表信息：

$ spark-shell
scala> spark.sql("DESCRIBE EXTENDED db.table").show(false)

This shows Table information like in my case:这显示了像我这样的表信息：

+----------------------------+---------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                    |comment|
+----------------------------+---------------------------------------------------------------------------------------------+-------+
|instance                    |int                                                                                          |null   |
|sensorId                    |string                                                                                       |null   |
|ts                          |timestamp                                                                                    |null   |
|value                       |double                                                                                       |null   |
|                            |                                                                                             |       |
|# Detailed Table Information|                                                                                             |       |
|Database                    |simulation                                                                                   |       |
|Table                       |uci_hydraulic_sensor_1                                                                       |       |                                                                                                                          |       |
|Created By                  |Spark 2.3.2.3.1.4.0-315                                                                      |       |
|Type                        |EXTERNAL                                                                                     |       |
|Provider                    |parquet                                                                                      |       |
|Statistics                  |244762020 bytes                                                                              |       |
|Location                    |hdfs://had-job.mycompany.de:8020/projects/pm/simulation/uci_hydraulic_sensor_1       <== This is important!
|Serde Library               |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe                                  |       |
|InputFormat                 |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat                                |       |
|OutputFormat                |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat                               |       |
+----------------------------+---------------------------------------------------------------------------------------------+-------+

To set the new table location with the HA cluster service name run following SQL:要使用 HA 集群服务名称设置新表位置，请按照 SQL 运行：

$ spark-shell
scala> spark.sql("ALTER TABLE simulation.uci_hydraulic_sensor_1 SET LOCATION 'hdfs://my-ha-name/projects/pm/simulation/uci_hydraulic_sensor_1'")

In further Spark sessions the table read works fine!在进一步的 Spark 会话中，表格读取工作正常！

kerberized Hadoop 环境中的 Spark 并启用了高可用性：Spark SQL 只能在写入任务后读取数据

问题描述

1 个解决方案

解决方案1
1 2020-04-22 15:39:14

kerberized Hadoop 环境中的 Spark 并启用了高可用性：Spark SQL 只能在写入任务后读取数据

问题描述

1 个解决方案

解决方案1 1 2020-04-22 15:39:14

解决方案1
1 2020-04-22 15:39:14