简体   繁体   English

尝试在 Kerberos 环境中使用 Spark on Yarn 以 sftp 模式写入 csv 文件

[英]Attempting to write csv file in sftp mode with Spark on Yarn in a Kerberos environment

I'm trying to write a Dataframe into a csv file and put this csv file into a remote machine.我正在尝试将 Dataframe 写入 csv 文件并将此 csv 文件放入远程机器。 The Spark job is running on Yarn into a Kerberos cluster. Spark 作业在 Yarn 上运行到 Kerberos 集群中。

Below, the error I get when the job tries to write the csv file on the remote machine :下面是当作业尝试在远程机器上写入 csv 文件时出现的错误:

diagnostics: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user=dev, access=WRITE, inode="/data/9/yarn/local/usercache/dev/appcache/application_1532962490515_15862/container_e05_1532962490515_15862_02_000001/tmp/spark_sftp_connection_temp178/_temporary/0":hdfs:hdfs:drwxr-xr-x诊断:用户类抛出异常:org.apache.hadoop.security.AccessControlException:权限被拒绝:user=dev, access=WRITE, inode="/data/9/yarn/local/usercache/dev/appcache/application_1532962490515_15862/container_e05_15329624150201 tmp/spark_sftp_connection_temp178/_temporary/0":hdfs:hdfs:drwxr-xr-x

In order to write this csv file, i'm using the folowing parameters in a method that write this file in sftp mode :为了编写此 csv 文件,我在以 sftp 模式编写此文件的方法中使用以下参数:

def writeToSFTP(df: DataFrame, path: String) = {
    df.write
      .format("com.springml.spark.sftp")
      .option("host", "hostname.test.fr")
      .option("username", "test_hostname")
      .option("password", "toto")
      .option("fileType", "csv")
      .option("delimiter", ",")
      .save(path)
  }

I'm using the Spark SFTP Connector library as described in the link : https://github.com/springml/spark-sftp我正在使用链接中描述的 Spark SFTP 连接器库: https : //github.com/springml/spark-sftp

The script which is used to launch the job is :用于启动作业的脚本是:

#!/bin/bash

kinit -kt /home/spark/dev.keytab dev@CLUSTER.HELP.FR

spark-submit --class fr.edf.dsp.launcher.LauncherInsertion \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 5g \
--executor-memory 5g \
--queue dev \
--files /home/spark/dev.keytab#user.keytab,\
/etc/krb5.conf#krb5.conf,\
/home/spark/jar/dev-application-SNAPSHOT.conf#app.conf \
--conf "spark.executor.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
/home/spark/jar/dev-SNAPSHOT.jar > /home/spark/out.log 2>&1&

The csv files are not written into HDFS. csv 文件不会写入 HDFS。 Once the Dataframe is built i try to send it to the machine.构建数据帧后,我尝试将其发送到机器。 I suspect a Kerberos issue with the sftp Spark connector : Yarn can't contact a remote machine...我怀疑 sftp Spark 连接器存在 Kerberos 问题:Yarn 无法联系远程机器...

Any help is welcome, thanks.欢迎任何帮助,谢谢。

add temporary location where you have write access, and do not worry about cleanup this because in the end after ftp done these files will be deleted,添加你有写权限的临时位置,不要担心清理这个,因为在 ftp 完成后这些文件最终会被删除,

def writeToSFTP(df: DataFrame, path: String) = {
        df.write
          .format("com.springml.spark.sftp")
          .option("host", "hostname.test.fr")
          .option("username", "test_hostname")
          .option("password", "toto")
          .option("fileType", "csv")
          **.option("hdfsTempLocation","/user/currentuser/")**
          .option("delimiter", ",")
          .save(path)
      }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM