在没有数据块的情况下将CSV文件转换为Spark 1.5.2中的数据帧

Question

I am trying to convert a csv file to a dataframe in Spark 1.5.2 with Scala without the use of the library databricks, as it is a community project and this library is not available. 我正在尝试使用Scala在Spark 1.5.2中将csv文件转换为数据帧，而不使用库databricks，因为它是社区项目，并且该库不可用。 My approach was the following: 我的方法如下：

var inputPath  = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim))
var header = rows.first()
var data = rows.filter(_(0) != header(0))
var df = sc.makeRDD(1 to data.count().toInt).map(i => (data.take(i).drop(i-1)(0)(0), data.take(i).drop(i-1)(0)(1), data.take(i).drop(i-1)(0)(2), data.take(i).drop(i-1)(0)(3), data.take(i).drop(i-1)(0)(4))).toDF(header(0), header(1), header(2), header(3), header(4))

This code, even though it is quite a mess, works without returning any error messages. 即使很乱，该代码也可以正常工作，而不会返回任何错误消息。 The problem comes when trying to display the data inside df in order to verify the correctness of this method and later try to do some queries in df . 当尝试在df中显示数据以验证此方法的正确性并随后尝试在df进行一些查询时，就会出现问题。 The error code I am getting after executing df.show() is SPARK-5063 . 执行df.show()后出现的错误代码是SPARK-5063 。 My questions are: 我的问题是：

1) Why is it not possible to print the content of df ? 1）为什么无法打印df的内容？

2) Is there any other more straightforward method to convert a csv to a dataframe in Spark 1.5.2 without using the library databricks ? 2）是否有其他更直接的方法可以在不使用库databricks情况下将Spark 1.5.2转换为Spark 1.5.2中的databricks ？

Answer 1

For spark 1.5.x can be used code snippet below to convert input into DF 对于spark 1.5.x，可以使用下面的代码片段将输入转换为DF

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the DataClass interface with 5 fields.
case class DataClass(id: Int, name: String, surname: String, bdate: String, address: String)

// Create an RDD of DataClass objects and register it as a table.
val peopleData = sc.textFile("input.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim, p(2).trim, p(3).trim, p(4).trim)).toDF()
peopleData.registerTempTable("dataTable")

val peopleDataFrame = sqlContext.sql("SELECT * from dataTable")

peopleDataFrame.show()

Spark 1.5 火花1.5

Answer 2

You can create like this: 您可以这样创建：

SparkSession spark = SparkSession
                .builder()
                .appName("RDDtoDF_Updated")
                .master("local[2]")
                .config("spark.some.config.option", "some-value")
                .getOrCreate();

        StructType schema = DataTypes
                .createStructType(new StructField[] {
                        DataTypes.createStructField("eid", DataTypes.IntegerType, false),
                        DataTypes.createStructField("eName", DataTypes.StringType, false),
                        DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
                        DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
                        DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
                        DataTypes.createStructField("eGen", DataTypes.StringType,true)});


        String filepath = "F:/Hadoop/Data/EMPData.txt";
        JavaRDD<Row> empRDD = spark.read()
                .textFile(filepath)
                .javaRDD()
                .map(line -> line.split("\\,"))
                .map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
                        Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));


        Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
        empDF.groupBy("edept").max("esal").show();

Answer 3

Using Spark with Scala. 在Scala中使用Spark。

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

var hiveCtx = new HiveContext(sc)
var inputPath  = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))
var header = rows.first()
val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))

val df = hiveCtx.createDataframe(rows,schema)

This should work. 这应该工作。

But for creating dataframe, would recommend you to use Spark-CSV . 但是对于创建数据框，建议您使用Spark-CSV 。

在没有数据块的情况下将CSV文件转换为Spark 1.5.2中的数据帧

问题描述

3 个解决方案

解决方案1
4 已采纳 2017-03-24 10:49:27

解决方案2
2 2017-03-24 10:21:27

解决方案3
0 2017-03-24 10:41:27

在没有数据块的情况下将CSV文件转换为Spark 1.5.2中的数据帧

问题描述

3 个解决方案

解决方案1 4 已采纳 2017-03-24 10:49:27

解决方案2 2 2017-03-24 10:21:27

解决方案3 0 2017-03-24 10:41:27

解决方案1
4 已采纳 2017-03-24 10:49:27

解决方案2
2 2017-03-24 10:21:27

解决方案3
0 2017-03-24 10:41:27