简体   繁体   中英

spark dataframe to be written in partitions

I am new to spark and scala. I want to read a directory containing json files. The file has attribute called "EVENT_NAME" which can have 20 different values. I need to separate the events, depending upon the attribute value. ie EVENT_NAME=event_A events together. Write these in hive external table structure like: /apps/hive/warehouse/db/event_A/dt=date/hour=hr

Here I have 20 different tables for all the event types and data related to each event should go to respective table. I have managed to write some code but need help to write my data correctly.

{
import org.apache.spark.sql._
import sqlContext._

val path = "/source/data/path"
val trafficRep = sc.textFile(path)

val trafficRepDf = sqlContext.read.json(trafficRep)
trafficRepDf.registerTempTable("trafficRepDf")

trafficRepDf.write.partitionBy("EVENT_NAME").save("/apps/hive/warehouse/db/sample")
}

The last line creates a partitioned output but is not how exactly I need it. Please suggest how can I get it correct or any other piece of code to do it.

I'm assuming you mean you'd like to save the data into separate directories, without using Spark/Hive's {column}={value} format.

You won't be able to use Spark's partitionBy , as Spark partitioning forces you to use that format.

Instead, you have to break your DataFrame into its component partitions, and save them one by one, like so:

{
  import org.apache.spark.sql._
  import sqlContext._

  val path = "/source/data/path"
  val trafficRep = sc.textFile(path)

  val trafficRepDf = sqlContext.read.json(trafficRep)
  val eventNames = trafficRepDf.select($"EVENT_NAME").distinct().collect() // Or if you already know what all 20 values are, just hardcode them.
  for (eventName <- eventNames) {
    val trafficRepByEventDf = trafficRepDef.where($"EVENT_NAME" === eventName)
    trafficRepByEventDf.write.save(s"/apps/hive/warehouse/db/sample/${eventName}")
  }
}

You can add columns with date and hour into your dataframe.

import org.apache.spark.sql._
import sqlContext._

val path = "/source/data/path"
val trafficRep = sc.textFile(path)

val trafficRepDf = sqlContext.read.json(trafficRep)
trafficRepDf.withColumn("dt", lit("dtValue")).withColumn("hour", lit("hourValue"))

trafficRepDf.write.partitionBy("EVENT_NAME","dt","hour").save("/apps/hive/warehouse/db/sample")

我假设您想要一个像/apps/hive/warehouse/db/EVENT_NAME=xx/dt=yy/hour=zz的表结构,那么您需要按EVENT_NAMEdthour进行分区,因此请尝试以下操作:

trafficRepDf.write.partitionBy("EVENT_NAME","dt","hour").save("/apps/hive/warehouse/db/sample")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM