简体   繁体   English

pyspark hiveContext 错误,同时执行 spark-submit 应用程序到纱线和远程 CDH kerberized env

[英]pyspark hiveContext error while executing spark-submit application to yarn and remote CDH kerberized env

error occurs while executing执行时发生错误

airflow@41166b660d82:~$ spark-submit --master yarn --deploy-mode cluster --keytab keytab_name.keytab --principal --jars keytab_name@REALM --jars /path/to/spark-hive_2.11-2.3.0.jar sranje.py

from airflow docker container not in CDH env (not managed by CDH CM).来自 airflow docker 容器不在 CDH 环境中(不由 CDH CM 管理)。 sranje.py is simple select * from hive table. sranje.py 是简单的 select * 来自 hive 表。

App is accepted on CDH yarn and executed twice with this error:应用程序在 CDH 纱线上被接受并执行两次,但出现以下错误:

...
2020-12-31 10:11:43 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
  File "sranje.py", line 21, in <module>
    source_df = hiveContext.table(hive_source).na.fill("")
  File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
  File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
  File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"
2020-12-31 10:11:43 ERROR ApplicationMaster:70 - User application exited with status 1
2020-12-31 10:11:43 INFO  ApplicationMaster:54 - Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
...

We assume that "some.jar's and java dependencies" are missing.我们假设缺少“some.jar 和 java 依赖项”。 Any ideas?有任何想法吗?

Details细节

  1. there is a valid krb ticket before executing spark cmd在执行 spark cmd 之前有一个有效的 krb 票证
  2. if we ommit --jars /path/to/spark-hive_2.11-2.3.0.jar , pyhton error is different如果我们省略--jars /path/to/spark-hive_2.11-2.3.0.jar ,pyhton 错误是不同的
...
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
...
  1. versions of spark(2.3.0), hadoop(2.6.0) and java are same as CDH spark(2.3.0)、hadoop(2.6.0) 和 java 的版本与 CDH 相同
  2. hive-site.xml, yarn-site.xml etc are also provided and valid hive-site.xml, yarn-site.xml 等也提供且有效
  3. this same spark-submit app executes OK from node inside of CDH cluster这个相同的 spark-submit 应用程序从 CDH 集群内的节点执行 OK
  4. we tried adding additional --jars spark-hive_2.11-2.3.0.jar,spark-core_2.11-2.3.0.jar,spark-sql_2.11-2.3.0.jar,hive-hcatalog-core-2.3.0.jar,spark-hive-thriftserver_2.11-2.3.0.jar we tried adding additional --jars spark-hive_2.11-2.3.0.jar,spark-core_2.11-2.3.0.jar,spark-sql_2.11-2.3.0.jar,hive-hcatalog-core-2.3.0.jar,spark-hive-thriftserver_2.11-2.3.0.jar
  5. developers use this code as an example:开发人员使用此代码作为示例:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext, HiveContext, functions as F
from pyspark.sql.utils import AnalysisException
from datetime import datetime

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc)

current_date = str(datetime.now().strftime('%Y-%m-%d'))

hive_source = "lnz_ch.lnz_cfg_codebook"
source_df = hiveContext.table(hive_source).na.fill("")

print("Number of records: {}".format(source_df.count()))
print("First 20 rows of the table:")
source_df.show(20)
  1. different script, same error不同的脚本,相同的错误
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder.appName("ZekoTest").enableHiveSupport().getOrCreate()
    data = spark.sql("SELECT * FROM lnz_ch.lnz_cfg_codebook")
    data.show(20)
    spark.close()

Thank you.谢谢你。

Hive dependecies are resolved with: Hive 依赖项通过以下方式解决:

  • downloading hive.tar.gz with exact version of CDH Hive下载hive.tar.gz与 CDH Hive 的确切版本
  • created symlinks from hive/ to spark/ ln -s apache-hive-1.1.0-bin/lib/*.jar spark-2.3.0-bin-without-hadoop/jars/创建了从 hive/ 到 spark/ 的符号链接ln -s apache-hive-1.1.0-bin/lib/*.jar spark-2.3.0-bin-without-hadoop/jars/
  • additional jars downloaded from maven repo to spark/jars/额外的 jars 从 maven repo 下载到 spark/jars/
hive-hcatalog-core-2.3.0.jar
slf4j-api-1.7.26.jar
spark-hive_2.11-2.3.0.jar
spark-hive-thriftserver_2.11-2.3.0.jar
  • refresh env var刷新环境变量
HADOOP_CLASSPATH=$(find $HADOOP_HOME -name '*.jar' | xargs echo | tr ' ' ':')
SPARK_DIST_CLASSPATH=$(hadoop classpath)

beeline works, but pyspark throws error直线工作,但 pyspark 抛出错误

2021-01-07 15:02:20 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
  File "sranje.py", line 21, in <module>
    source_df = hiveContext.table(hive_source).na.fill("")
  File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
  File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
  File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.table.
: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME

But, that's another question.但是,这是另一个问题。 Thank you all.谢谢你们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 远程 pyspark shell 和火花提交错误 java.lang.NoSuchFieldError: METASTORE_CLIENT_TIMESOCKETLI - remote pyspark shell and spark-submit error java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME 为什么 spark-submit 会因“执行 Jupyter 命令时出错”而失败? - Why does spark-submit fail with “Error executing Jupyter command”? 如何激发提交Spark Streaming应用程序 - How to spark-submit a Spark Streaming application SparkLauncher用用户作为蜂巢的yarn-client运行spark-submit - SparkLauncher Run spark-submit with yarn-client with user as hive 使用--jars的spark-submit yarn-cluster不起作用? - spark-submit yarn-cluster with --jars does not work? 为什么 spark-submit 无法部署到 YARN? - Why does spark-submit fail to deploy to YARN? 通过--config时,火花提交失败 - Spark-submit fails while passing --config 从本地 spark-submit 检查远程 HDFS 上是否存在文件 - Check if file exists on remote HDFS from local spark-submit 使用spark-submit运行Spark应用程序时出现ClassNotFoundException - ClassNotFoundException when running Spark application with spark-submit 使用spark-submit部署应用程序:应用程序已添加到调度程序中,尚未激活 - Deploying application with spark-submit: Application is added to the scheduler and is not yet activated
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM