简体   繁体   中英

How to create DataFrame not using Case Class?

I want to create Dataframe from text file.

Case Class has limitation of 22 Characters; I have more than 100 fields.

Hence I am facing issue while creating Case Class.

My actual target is create Dataframe;

Is there any other way to create Dataframe, not using Case Class?

One way is to use spark csv package to read the files directly and create dataframe. Package will directly infer the schema from the header if your file has a header or you can create a custom schema using struct type.

In the below example , i have created a custom schema.

val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
    StructField("year", IntegerType, true),
    StructField("make", StringType, true),
    StructField("model", StringType, true),
    StructField("comment", StringType, true),
    StructField("blank", StringType, true)))

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(customSchema)
    .load("cars.csv")

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

You can check the other various options on databricks spark csv documentation page .

Other option:

You can create a schema using struct type as shown above and then use createDataframe of sqlContext to create dataframe.

val vRdd = sc.textFile(..filelocation..)
val df = sqlContext.createDataframe(vRdd,schema)

From the Spark Documentation:

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

  1. Create an RDD of Rows from the original RDD;
  2. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
  3. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext .

Other way is to define StructField with datatyoe within StructType . It will allow you to define multiple datatype. Please see the example below for both of implementation. Please consider commented code also to understand both implementation.

package com.spark.examples

import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.types._

// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{ StructType, StructField, StringType }

object MultipleDataTypeSchema extends Serializable {

  val conf = new SparkConf().setAppName("schema definition")

  conf.set("spark.executor.memory", "100M")
  conf.setMaster("local")

  val sc = new SparkContext(conf);
  // sc is an existing SparkContext.
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  def main(args: Array[String]): Unit = {

    // Create an RDD
    val people = sc.textFile("C:/Users/User1/Documents/test")

    /* First Implementation:The schema is encoded in a string, split schema then map it.
     * All column dataype will be string type.

    //Generate the schema based on the string of schema
    val schemaString = "name address age" //Here you can read column from a preoperties file too.  
    val schema =
      StructType(
        schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));*/

    // Second implementation: Define multiple datatype 

    val schema =
      StructType(
        StructField("name", StringType, true) ::
          StructField("address", StringType, true) ::
          StructField("age", StringType, false) :: Nil)

    // Convert records of the RDD (people) to Rows.
    val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim, p(2).trim))
    // Apply the schema to the RDD.
    val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
    peopleDataFrame.printSchema()

    sc.stop

  }
}

Its Output:

17/01/03 14:24:13 INFO SparkContext: Created broadcast 0 from textFile at MultipleDataTypeSchema.scala:30
root
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- age: string (nullable = false)

Reading a file through sqlContext's sqlContext.read.csv() method works well. As it has many in built-in methods available where you can pass parameters and control the execution. But working on spark versions prior to 1.6 may not have this available. So you may also do it by spark-context's textFile method.

Val a = sc.textFile("file:///file-path/fileName")

This gives you an RDD[String]. So you have created the RDD now and you want to convert this to a dataframe.

Now go ahead and define the schema for your RDD using StructTypes. This allows you to have as many number of StructFields as you may need.

val schema = StructType(Array(StructField("fieldName1", fieldType, ifNullablle),
                              StructField("fieldName2", fieldType, ifNullablle),
                              StructField("fieldName3", fieldType, ifNullablle),
                              ................
                              ))

You now have two things: 1) RDD, which we created using textFile method. 2) Schema, with required number of attributes.

Next step is definitely to map this schema with your RDD right ! You may observe that the RDD you have is a single String, ie RDD[String]. But what you actually want to do with this is to convert it into those many number of variables for which you created the schema. So why not split your RDD based on comma. The following expression should do this using a map operation.

val b = a.map(x => x.split(","))

you get an RDD[Array[String]] on evaluation.

But you may say that this Array[String] is still not that intuitive that I may apply any operation. So there comes the Row API to your respite. Get it imported using import org.apache.spark.sql.Row and we'll actually be mapping your splitted RDD with Row object as a tuple. See this :

import org.apache.spark.sql.Row
val c = b.map(x => Row(x(0), x(1),....x(n)))

The above expression gives you an RDD where each and every element is a Row. You just need to give it a schema now. Again sqlContext's createDataFrame method does the job for you so simply.

val myDataFrame = sqlContext.createDataFrame(c, schema)

This method takes two parameters: 1) The RDD you need to work on. 2) The schema you want to apply on top of it. Resulting evaluation is the DataFrame object. So finally we now have our DataFrame object myDataFrame created. And if you use the show method on your myDataFrame, you get to see the data in tabular format. You are now good to perform any spark-sql operation on it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM