[英]Spark in kerberized Hadoop environment and High Availability enabled: Spark SQL can only read data after write task
We were using a kerberized Hadoop environment (HDP 3.1.4 with Spark 2.3.2 and Ambari 2.7.4) for a long time, everything went well so far.我们使用了 Kerberized Hadoop 环境(HDP 3.1.4 与 Spark 2.3.2 和 Ambari 2.7.4)很长一段时间,到目前为止一切顺利。
Now we enabled NameNode high availability and have the following issue: When we want to read data using Spark SQL, we first have to write some (other) data.现在我们启用了 NameNode 高可用性并遇到以下问题:当我们想使用 Spark SQL 读取数据时,我们首先必须写入一些(其他)数据。 If we don't write something before the read operation, it fails.
如果我们在读操作之前不写东西,它就会失败。
Here our scenario:这是我们的场景:
$ kinit -kt /etc/security/keytabs/user.keytab user
$ spark-shell
scala> spark.sql("SELECT * FROM pm.simulation_uci_hydraulic_sensor").show
Hive Session ID = cbb6b6e2-a048-41e0-8e77-c2b2a7f52dbe
[Stage 0:> (0 + 1) / 1]20/04/22 15:04:53 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, had-data6.my-company.de, executor 2): java.io.IOException: DestHost:destPort had-job.my-company.de:8020 , LocalHost:localPort had-data6.my-company.de/192.168.178.123:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:862)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:851)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:840)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1004)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:320)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:316)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:328)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:522)
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:364)
at org.apache.orc.OrcFile.createReader(OrcFile.java:251)
[...]
scala> val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS: org.apache.spark.sql.Dataset[Int] = [value: int]
scala> primitiveDS.write.saveAsTable("pm.todelete3")
20/04/22 15:05:07 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
scala> spark.sql("SELECT * FROM pm.simulation_uci_hydraulic_sensor").show
+--------+--------+--------------------+------+
|instance|sensorId| ts| value|
+--------+--------+--------------------+------+
| 21| PS6|2020-04-18 17:19:...| 8.799|
| 21| EPS1|2020-04-18 17:19:...|2515.6|
| 21| PS3|2020-04-18 17:19:...| 2.187|
+--------+--------+--------------------+------+
When running a new spark-shell
session, same behavior!运行新的
spark-shell
session 时,行为相同!
Can someone help with this issue?有人可以帮助解决这个问题吗? Thank you!
谢谢!
We found the answer for the problem: The table properties pointed to the "old" NameNode location in the table that was created before activating High Availability in the Hadoop cluster.我们找到了问题的答案:表属性指向表中的“旧”NameNode 位置,该位置是在 Hadoop 集群中激活高可用性之前创建的。
You can find table information by running the following command:您可以通过运行以下命令来查找表信息:
$ spark-shell
scala> spark.sql("DESCRIBE EXTENDED db.table").show(false)
This shows Table information like in my case:这显示了像我这样的表信息:
+----------------------------+---------------------------------------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+---------------------------------------------------------------------------------------------+-------+
|instance |int |null |
|sensorId |string |null |
|ts |timestamp |null |
|value |double |null |
| | | |
|# Detailed Table Information| | |
|Database |simulation | |
|Table |uci_hydraulic_sensor_1 | | | |
|Created By |Spark 2.3.2.3.1.4.0-315 | |
|Type |EXTERNAL | |
|Provider |parquet | |
|Statistics |244762020 bytes | |
|Location |hdfs://had-job.mycompany.de:8020/projects/pm/simulation/uci_hydraulic_sensor_1 <== This is important!
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |
+----------------------------+---------------------------------------------------------------------------------------------+-------+
To set the new table location with the HA cluster service name run following SQL:要使用 HA 集群服务名称设置新表位置,请按照 SQL 运行:
$ spark-shell
scala> spark.sql("ALTER TABLE simulation.uci_hydraulic_sensor_1 SET LOCATION 'hdfs://my-ha-name/projects/pm/simulation/uci_hydraulic_sensor_1'")
In further Spark sessions the table read works fine!在进一步的 Spark 会话中,表格读取工作正常!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.