spark dataframe to be written in partitions

Question

I am new to spark and scala. I want to read a directory containing json files. The file has attribute called "EVENT_NAME" which can have 20 different values. I need to separate the events, depending upon the attribute value. ie EVENT_NAME=event_A events together. Write these in hive external table structure like: /apps/hive/warehouse/db/event_A/dt=date/hour=hr

Here I have 20 different tables for all the event types and data related to each event should go to respective table. I have managed to write some code but need help to write my data correctly.

{
import org.apache.spark.sql._
import sqlContext._

val path = "/source/data/path"
val trafficRep = sc.textFile(path)

val trafficRepDf = sqlContext.read.json(trafficRep)
trafficRepDf.registerTempTable("trafficRepDf")

trafficRepDf.write.partitionBy("EVENT_NAME").save("/apps/hive/warehouse/db/sample")
}

The last line creates a partitioned output but is not how exactly I need it. Please suggest how can I get it correct or any other piece of code to do it.

Answer 1

I'm assuming you mean you'd like to save the data into separate directories, without using Spark/Hive's {column}={value} format.

You won't be able to use Spark's partitionBy , as Spark partitioning forces you to use that format.

Instead, you have to break your DataFrame into its component partitions, and save them one by one, like so:

{
  import org.apache.spark.sql._
  import sqlContext._

  val path = "/source/data/path"
  val trafficRep = sc.textFile(path)

  val trafficRepDf = sqlContext.read.json(trafficRep)
  val eventNames = trafficRepDf.select($"EVENT_NAME").distinct().collect() // Or if you already know what all 20 values are, just hardcode them.
  for (eventName <- eventNames) {
    val trafficRepByEventDf = trafficRepDef.where($"EVENT_NAME" === eventName)
    trafficRepByEventDf.write.save(s"/apps/hive/warehouse/db/sample/${eventName}")
  }
}

Answer 2

You can add columns with date and hour into your dataframe.

import org.apache.spark.sql._
import sqlContext._

val path = "/source/data/path"
val trafficRep = sc.textFile(path)

val trafficRepDf = sqlContext.read.json(trafficRep)
trafficRepDf.withColumn("dt", lit("dtValue")).withColumn("hour", lit("hourValue"))

trafficRepDf.write.partitionBy("EVENT_NAME","dt","hour").save("/apps/hive/warehouse/db/sample")

Answer 3

我假设您想要一个像/apps/hive/warehouse/db/EVENT_NAME=xx/dt=yy/hour=zz的表结构，那么您需要按EVENT_NAME ， dt和hour进行分区，因此请尝试以下操作：

trafficRepDf.write.partitionBy("EVENT_NAME","dt","hour").save("/apps/hive/warehouse/db/sample")

Answer 4

https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-16-to-20

Dataset and DataFrame API registerTempTable has been deprecated and replaced by createOrReplaceTempView

https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter@saveAsTable(tableName:String):Unit

spark dataframe to be written in partitions

Question

4 answers

solution1
2 ACCPTED 2017-05-29 18:28:22

solution2
1 2017-05-29 07:59:05

solution3
0 2017-05-29 05:58:23

solution4
0 2017-05-29 08:34:44

spark dataframe to be written in partitions

Question

4 answers

solution1 2 ACCPTED 2017-05-29 18:28:22

solution2 1 2017-05-29 07:59:05

solution3 0 2017-05-29 05:58:23

solution4 0 2017-05-29 08:34:44

solution1
2 ACCPTED 2017-05-29 18:28:22

solution2
1 2017-05-29 07:59:05

solution3
0 2017-05-29 05:58:23

solution4
0 2017-05-29 08:34:44