SparkSession初始化错误 - 无法使用spark.read

Question

I tried to create a standalone PySpark program that reads a csv and stores it in a hive table. 我尝试创建一个独立的PySpark程序，它读取csv并将其存储在hive表中。 I have trouble configuring Spark session, conference and contexts objects. 我在配置Spark会话，会议和上下文对象时遇到问题。 Here is my code: 这是我的代码：

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *

conf = SparkConf().setAppName("test_import")
sc = SparkContext(conf=conf)
sqlContext  = SQLContext(sc)

spark = SparkSession.builder.config(conf=conf)
dfRaw = spark.read.csv("hdfs:/user/..../test.csv",header=False)

dfRaw.createOrReplaceTempView('tempTable')
sqlContext.sql("create table customer.temp as select * from tempTable")

And I get the error: 我收到错误：

dfRaw = spark.read.csv("hdfs:/user/../test.csv",header=False) AttributeError: 'Builder' object has no attribute 'read' dfRaw = spark.read.csv（“hdfs：/ user /../ test.csv”，header = False）AttributeError：'Builder'对象没有属性'read'

Which is the right way to configure spark session object in order to use read.csv command? 为了使用read.csv命令，哪种配置spark会话对象的正确方法？ Also, can someone explain the diference between Session, Context and Conference objects? 另外，有人可以解释Session，Context和Conference对象之间的差异吗？

Answer 1

There is no need to use both SparkContext and SparkSession to initialize Spark. 无需使用SparkContext和SparkSession来初始化Spark。 SparkSession is the newer, recommended way to use. SparkSession是较新的推荐使用方式。

To initialize your environment, simply do: 要初始化您的环境，只需执行以下操作：

spark = SparkSession\
  .builder\
  .appName("test_import")\
  .getOrCreate()

You can run SQL commands by doing: 您可以执行以下操作来运行SQL命令：

spark.sql(...)

Prior to Spark 2.0.0, three separate objects were used: SparkContext , SQLContext and HiveContext . 在Spark 2.0.0之前，使用了三个独立的对象： SparkContext ， SQLContext和HiveContext 。 These were used separatly depending on what you wanted to do and the data types used. 这些是分开使用的，具体取决于您想要做什么以及使用的数据类型。

With the intruduction of the Dataset/DataFrame abstractions, the SparkSession object became the main entry point to the Spark environment. 随着数据集/数据帧抽象的SparkSession ， SparkSession对象成为Spark环境的主要入口点。 It's still possible to access the other objects by first initialize a SparkSession (say in a variable named spark ) and then do spark.sparkContext / spark.sqlContext . 通过首先初始化SparkSession （例如在名为spark的变量中）然后执行spark.sparkContext / spark.sqlContext仍然可以访问其他对象。

SparkSession初始化错误 - 无法使用spark.read

问题描述

1 个解决方案

解决方案1
9 已采纳 2017-10-24 08:55:31

SparkSession初始化错误 - 无法使用spark.read

问题描述

1 个解决方案

解决方案1 9 已采纳 2017-10-24 08:55:31

解决方案1
9 已采纳 2017-10-24 08:55:31