spark streaming save base64 rdd to json on s3

Question

The scala application below cannot save an rdd in json format onto S3

I have :-

a kinesis stream that has complex objects placed on the stream. This object has had JSON.stringify() applied to it before being placed on the stream as part of the Kinesis PutRecord method.
A scala spark stream job reads these items off the stream,

I cannot seem to save the rdd record that comes off the stream into json file onto an S3 bucket.

In the code i've attempted to convert the RDD[Bytes] to RDD[String] then load with spark.read.json but no luck. I've tried various other combinations and can't seem to output the onto S3 in it's raw format.

import org.apache.spark._
import org.apache.spark.sql._
import java.util.Base64
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import org.apache.spark.streaming.kinesis.KinesisInputDStream

import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest

object ScalaStream {
  def main(args: Array[String]): Unit = {  
        val appName = "ScalaStreamExample"
        val batchInterval = Milliseconds(2000)
        val outPath = "s3://xxx-xx--xxx/xxxx/"

        val spark = SparkSession
            .builder()
            .appName(appName)
            .getOrCreate()

        val sparkContext = spark.sparkContext
        val streamingContext = new StreamingContext(sparkContext, batchInterval)

        // Populate the appropriate variables from the given args
        val checkpointAppName = "xxx-xx-xx--xx"
        val streamName = "cc-cc-c--c--cc"
        val endpointUrl = "https://kinesis.xxx-xx-xx.amazonaws.com"
        val regionName = "cc-xxxx-xxx"
        val initialPosition = new Latest()
        val checkpointInterval = batchInterval
        val storageLevel = StorageLevel.MEMORY_AND_DISK_2

        val kinesisStream = KinesisInputDStream.builder
         .streamingContext(streamingContext)
         .endpointUrl(endpointUrl)
         .regionName(regionName)
         .streamName(streamName)
         .initialPosition(initialPosition)
         .checkpointAppName(checkpointAppName)
         .checkpointInterval(checkpointInterval)
         .storageLevel(StorageLevel.MEMORY_AND_DISK_2)
         .build()

        kinesisStream.foreachRDD { rdd =>
            if (!rdd.isEmpty()){
                //**************** .  <---------------
                // This is where i'm trying to save the raw json object to s3 as json file
                // tried various combinations here but no luck. 
                val dataFrame = rdd.map(record=>new String(record)) // convert bytes to string
                dataFrame.write.mode(SaveMode.Append).json(outPath + "/" + rdd.id.toString())
                //**************** <----------------
            }
        }

        // Start the streaming context and await termination
        streamingContext.start()
        streamingContext.awaitTermination()
    }
}

Any ideas what i'm missing?

Answer 1

So it was complete red herring why it failed to work. Turns out it was a scala version conflict with what is available on EMR.

Many similar questions asked on SO that suggested this may be the issue but whilst the spark documentation lists 2.12.4 is compatible with spark 2.4.4, the EMR instance does not appear to support scala version 2.12.4. So i've updated my build.sbt and deploy script from

build.sbt:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.12.8"

ibraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"

to:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.4"

deploy.sh

aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://ccc.xxxx/simple-project_2.11-1.0.jar\
],ActionOnFailure=CONTINUE

spark streaming save base64 rdd to json on s3

Question

1 answers

solution1
1 2020-02-07 11:22:35

spark streaming save base64 rdd to json on s3

Question

1 answers

solution1 1 2020-02-07 11:22:35

solution1
1 2020-02-07 11:22:35