NameError：名称“ SparkSession”未定义

Question

I'm new to cask cdap and Hadoop environment. 我不熟悉cdap和Hadoop环境。

I'm creating a pipeline and I want to use a PySpark Program. 我正在创建管道，并且想使用PySpark程序。 I have all the script of the spark program and it works when I test it by command like, insted it doesn't if I try to copy- paste it in a cdap pipeline. 我拥有spark程序的所有脚本，并且在通过命令进行测试时可以正常运行，例如，如果我尝试将其复制粘贴到cdap管道中，则说明它没有。

It gives me an error in the logs: 它在日志中给我一个错误：

NameError: name 'SparkSession' is not defined

My script starts in this way: 我的脚本以这种方式启动：

from pyspark.sql import *

spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import trim, to_date, year, month
sc= SparkContext()

How can I fix it? 我该如何解决？

Answer 1

Spark connects with the local running spark cluster through SparkContext . Spark通过SparkContext与本地运行的Spark集群SparkContext 。 A better explanation can be found here https://stackoverflow.com/a/24996767/5671433 . 可以在这里找到更好的解释https://stackoverflow.com/a/24996767/5671433 。

To initialise a SparkSession , a SparkContext has to be initialized. 要初始化SparkSession ，必须初始化SparkContext 。 One way to do that is to write a function that initializes all your contexts and a spark session. 一种实现方法是编写一个函数，该函数初始化所有上下文和spark会话。

def init_spark(app_name, master_config):
    """
    :params app_name: Name of the app
    :params master_config: eg. local[4]
    :returns SparkContext, SQLContext, SparkSession:
    """
    conf = (SparkConf().setAppName(app_name).setMaster(master_config))

    sc = SparkContext(conf=conf)
    sc.setLogLevel("ERROR")
    sql_ctx = SQLContext(sc)
    spark = SparkSession(sc)

    return (sc, sql_ctx, spark)

This can then be called as 这可以称为

sc, sql_ctx, spark = init_spark("App_name", "local[4]")

NameError：名称“ SparkSession”未定义

问题描述

1 个解决方案

解决方案1
1 2018-04-14 13:37:46

NameError：名称“ SparkSession”未定义

问题描述

1 个解决方案

解决方案1 1 2018-04-14 13:37:46

解决方案1
1 2018-04-14 13:37:46