如何不使用Case类创建DataFrame？

Question

我想从文本文件创建数据框。

案例类别限制为22个字符； 我有100多个领域。

因此，在创建Case Class时面临问题。

我的实际目标是创建数据框；

还有其他方法可以创建Dataframe，而不使用Case类吗？

Answer 1

一种方法是使用spark csv包直接读取文件并创建数据框。 如果文件具有标题，或者您可以使用结构类型创建自定义模式，则包将直接从标题中推断模式。

在下面的示例中，我创建了一个自定义架构。

val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
    StructField("year", IntegerType, true),
    StructField("make", StringType, true),
    StructField("model", StringType, true),
    StructField("comment", StringType, true),
    StructField("blank", StringType, true)))

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(customSchema)
    .load("cars.csv")

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

您可以在databricks spark csv文档页面上查看其他各种选项。

其他选择：

您可以使用上面显示的结构类型创建模式，然后使用sqlContext的createDataframe创建数据框。

val vRdd = sc.textFile(..filelocation..)
val df = sqlContext.createDataframe(vRdd,schema)

Answer 2

从Spark文档：

如果无法提前定义案例类（例如，记录的结构编码为字符串，或者将解析文本数据集，并且针对不同的用户对字段进行不同的投影），则可以通过三个步骤以编程方式创建DataFrame 。

从原始RDD创建行的RDD；
在第1步中创建的RDD中，创建一个由StructType表示的模式，该模式与Rows的结构相匹配。
通过SQLContext提供的createDataFrame方法将模式应用于行的RDD。

另一种方法是在StructField中用数据类型定义StructType 。 它将允许您定义多个数据类型。 请参见下面的示例以了解这两种实现方式。 请同时考虑注释的代码，以了解这两种实现。

package com.spark.examples

import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.types._

// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{ StructType, StructField, StringType }

object MultipleDataTypeSchema extends Serializable {

  val conf = new SparkConf().setAppName("schema definition")

  conf.set("spark.executor.memory", "100M")
  conf.setMaster("local")

  val sc = new SparkContext(conf);
  // sc is an existing SparkContext.
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  def main(args: Array[String]): Unit = {

    // Create an RDD
    val people = sc.textFile("C:/Users/User1/Documents/test")

    /* First Implementation:The schema is encoded in a string, split schema then map it.
     * All column dataype will be string type.

    //Generate the schema based on the string of schema
    val schemaString = "name address age" //Here you can read column from a preoperties file too.  
    val schema =
      StructType(
        schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));*/

    // Second implementation: Define multiple datatype 

    val schema =
      StructType(
        StructField("name", StringType, true) ::
          StructField("address", StringType, true) ::
          StructField("age", StringType, false) :: Nil)

    // Convert records of the RDD (people) to Rows.
    val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim, p(2).trim))
    // Apply the schema to the RDD.
    val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
    peopleDataFrame.printSchema()

    sc.stop

  }
}

其输出：

17/01/03 14:24:13 INFO SparkContext: Created broadcast 0 from textFile at MultipleDataTypeSchema.scala:30
root
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- age: string (nullable = false)

Answer 3

通过sqlContext的sqlContext.read.csv（）方法读取文件效果很好。 由于它具有许多可用的内置方法，您可以在其中传递参数并控制执行。 但是在1.6之前的Spark版本上工作可能不可用。 因此，您也可以通过spark-context的textFile方法来实现。

Val a = sc.textFile("file:///file-path/fileName")

这为您提供了RDD [String]。 因此，您现在已经创建了RDD，并且想要将其转换为数据框。

现在，继续使用StructTypes为RDD定义架构。 这使您可以拥有任意数量的StructField。

val schema = StructType(Array(StructField("fieldName1", fieldType, ifNullablle),
                              StructField("fieldName2", fieldType, ifNullablle),
                              StructField("fieldName3", fieldType, ifNullablle),
                              ................
                              ))

您现在有两件事：1）RDD，我们使用textFile方法创建。 2）架构，具有所需的属性数量。

下一步肯定是使用您的RDD正确映射此架构！ 您可能会发现您拥有的RDD是单个String，即RDD [String]。 但是，您实际上要执行的操作是将其转换为创建了架构的众多变量。 那么，为什么不基于逗号分割您的RDD。 以下表达式应使用map操作来执行此操作。

val b = a.map(x => x.split(","))

您得到评估时的RDD [Array [String]]。

但是您可能会说，这个Array [String]仍然不那么直观，我可以执行任何操作。 因此，Row API值得您休养。 使用import org.apache.spark.sql.Row将其导入，我们实际上将使用Row对象作为元组映射拆分的RDD。 看到这个：

import org.apache.spark.sql.Row
val c = b.map(x => Row(x(0), x(1),....x(n)))

上面的表达式为您提供了一个RDD，其中每个元素都是一个Row。 您只需要立即为其提供一个架构。 再次，sqlContext的createDataFrame方法为您轻松完成了这项工作。

val myDataFrame = sqlContext.createDataFrame(c, schema)

此方法有两个参数：1）您需要处理的RDD。 2）您要在其之上应用的架构。 结果评估为DataFrame对象。 所以最后我们现在创建了DataFrame对象myDataFrame。 而且，如果在myDataFrame上使用show方法，则可以查看表格格式的数据。 现在，您可以对它执行任何spark-sql操作。

如何不使用Case类创建DataFrame？

问题描述

3 个解决方案

解决方案1
4 2017-01-03 08:37:29

解决方案2
2 2017-01-03 09:05:05

解决方案3
0 2017-10-08 11:20:16

如何不使用Case类创建DataFrame？

问题描述

3 个解决方案

解决方案1 4 2017-01-03 08:37:29

解决方案2 2 2017-01-03 09:05:05

解决方案3 0 2017-10-08 11:20:16

解决方案1
4 2017-01-03 08:37:29

解决方案2
2 2017-01-03 09:05:05

解决方案3
0 2017-10-08 11:20:16