简体   繁体   中英

Kafka Streaming Using Scala and Spark Error: Exception thrown in awaitResult

I'm still new to Kafka and I'm currently trying to generate and read a.csv file, and then stream it using IntelliJ IDEA (running in windows) and Kafka to a Kafka consumer (zookeeper, broker, and consumer are running in WSL), but I keep failing to do so.

Here is my build.sbt:

name := "kafka_test"

version := "0.1"

scalaVersion := "2.13.6"

//libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.2.0"
//libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "3.2.0"
//libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.6.0"
//libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.0"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "3.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "3.2.0"

Here is my code:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
//import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming._
//import org.apache.spark.streaming.kafka
import org.apache.spark.streaming.StreamingContext._

import java.beans.Statement
import java.sql.{Connection, DriverManager, SQLException}
import scala.io.StdIn
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Encoder, Encoders, SparkSession, functions}

import scala.util.control.Breaks._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.functions.{col, count, countDistinct, desc, from_json, when}
import org.apache.spark.sql.types.StructType

import java.io._

import CustomImplicits._


object kafkar {

  def main(args: Array[String]): Unit = {

    Logger.getLogger("org").setLevel(Level.OFF)
    Logger.getLogger("akka").setLevel(Level.OFF)

    println("Program Started")

    val conf = new SparkConf().setMaster("local[4]").setAppName("kafkar")
    val ssc = new StreamingContext(conf, Seconds(2))

    //INITIATE SPARK SESSION//
    System.setProperty("hadoop.home.dir", "C:\\hadoop")
    val spark = SparkSession
      .builder
      .appName("Kafka Streaming")
      .config("spark.master", "local")
      .enableHiveSupport()
      .getOrCreate()
    println("Created Spark Session")
    spark.sparkContext.setLogLevel("ERROR")


    //my kafka topic name is 'mytest'
//    val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("mytest" -> 5) )
//    kafkaStream.print()
//    ssc.start
//    ssc.awaitTermination()


    //GENERATES A .CSV FILE WITH REQUIRED SCHEMA
    val customer_names = List("SpaceX", "Blue Origin", "Orbital Sciences Corporation", "Boeing", "Northrop Grumman Innovation Systems", "Sierra Nevada Corporation", "Scaled Composites", "The Spaceship Company", "NASA", "Lockheed Martin", "ESA", "JAXA", "Rocket Lab", "Virgin Galactic", "Copenhagen Suborbitals", "ROSCOSMOS", "CNSA")
    val customer_countries = List("United States", "United States", "United States", "United States", "United States", "United States", "United States", "United States", "United States", "United States", "France", "Japan", "New Zealand", "England", "Denmark", "Russia", "China")
    val customer_cities = List("Hawthorne CA", "Kent WA", "Dulles VA", "Chicago IL", "Dulles VA", "Sparks NV", "Mojave CA", "Mojave CA", "Washington DC", "Bethesda MD", "Paris", "Tokyo", "Auckland", "London", "Copenhagen", "Moscow", "Beijing")

    val product_names = List("Dragon Capsule", "Falcon 9 Rocket", "Dream Chaser Cargo System", "Biconic Farrier", "Second-stage Fuselage", "Life Support Systems", "Reaction Wheels", "Air Jordans", "Geosynchronous Satellite", "Docking Ports (x3)", "Space Junk")
    val product_categories = List("Rocket", "Rocket", "System", "Part", "Part", "System", "Part", "Misc.", "Satellite", "Part", "Misc.")
    val product_prices = List("$100,000", "$10,000,000", "$1,000,000", "$1000", "$10,000", "$10,000", "$100", "Priceless", "$1,000,000", "$1000", "$0")

    val payment_types = List("Mastercard", "Discover", "Capital One", "Zelle Transfer", "UPI", "Google Wallet", "Apple Pay")

    val failure_reasons = List("Invalid CVV", "Not Enough Balance", "Incorrect Payment Address", "Suspicious Purchase Activity", "They're totally using this to make a bomb...")

    val r = scala.util.Random
    var now = java.time.Instant.now
    //val file = scala.tools.nsc.io.File("transactions.csv")
    val file = new File("input/transactions.csv" )
    val printWriter = new PrintWriter(file)
    file.delete()

    for(i <- 1 to 2000){
      val rand_customer = r.nextInt(customer_names.length)
      val rand_product = r.nextInt(product_names.length)
      val rand_payment = payment_types(r.nextInt(payment_types.length))
      val rand_quantity = (-Math.log(r.nextDouble())*10).toInt + 1
      val rand_txn_id = (r.alphanumeric take 10).mkString
      val rand_success = if (r.nextInt(100) == 0) "N" else "Y"
      val rand_reason = if(rand_success == "Y") " " else failure_reasons(r.nextInt(failure_reasons.length))
      val rand_time_pass = r.nextInt(50000)
      now = now.plusSeconds(rand_time_pass)

      //order_id, customer_id, customer_name, product_id, product_name, product_category, payment_type, qty, price, datetime, country, city, ecommerce_website_name, payment_txn_id, payment_txn_success, failure_reason
      val transaction = List(i, 101 + rand_customer, customer_names(rand_customer), 10001 + rand_product, product_names(rand_product), product_categories(rand_product), rand_payment, rand_quantity, product_prices(rand_product), now, customer_countries(rand_customer), customer_cities(rand_customer), "AllTheSpaceYouNeed.com", rand_txn_id, rand_success, rand_reason).mkString(",")

      //println(transaction)
      //file.appendAll(transaction + "\n")
      printWriter.write(transaction + "\n")
    }


    //    val df = spark.readStream
    //      .format("kafka")
    //      .option("kafka.bootstrap.servers", "localhost:9092")
    //      .option("subscribe", "kafka_test_topic")
    //      .option("startingOffsets", "earliest") // From starting
    //      .load()
    //
    //    df.printSchema()
    //
    //    val personStringDF = df.selectExpr("CAST(value AS STRING)")

    val userSchema = new StructType().add("order_id", "integer").add("customer_id", "integer").add("customer_name", "string")
      .add("product_id", "integer").add("product_name", "string").add("product_category", "string").add("qty", "integer")
      .add("price", "integer").add("datetime", "string").add("country", "string").add("city", "string")
      .add("ecommerce_website_name", "string").add("payment_txn_id", "string")
      .add("payment_txn_success", "string").add("failure_reason", "string")



//    val df = spark.readStream
//      .format("rate")
//      .option("rowsPerSecond", 10)
//      .load()


    //    df.writeStream
    //      .option("checkpointLocation", "/input/")
    //      .toTable("myTable")
    //
    //    // Check the table result
    //    spark.read.table("myTable").show()
    // Write the streaming DataFrame to a table
    /* df.writeStream
       .option("checkpointLocation", "path/to/checkpoint/dir")
       .toTable("myTable")

     spark.read.table("myTable").show()*/
    /*      val df = spark.read.csv("transactions.csv")
          df.show()*/

    //    val csvDF = personStringDF.select(from_json(col("value"), userSchema).as("data"))
    //      .select("data.*")

    //READ THE .CSV FILE
    val csvDF = spark
      .readStream
      .option("sep", ",")
//      .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
//      .option("subscribe", "topic1")
      .schema(userSchema)      // Specify schema of the csv files
      .format("csv")
      .load("input\\")    // Equivalent to format("csv").load("/path/to/directory")


//    csvDF.writeStream
//      .format("console")
//      .outputMode("append")
//      .start()
//      .awaitTermination()

    //WRITE IT TO A KAFKA TOPIC
    csvDF
      .writeStream // use `write` for batch, like DataFrame
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("topic", "target_topic")
      .option("checkpointLocation", "tmp/vaquarkhan/checkpoint")
      .start()

  }

}

and here is the error I'm getting:

Exception in thread "stream execution thread for [id = c9d29df1-5f05-44b3-958e-a99e1a87beb9, runId = e0c3b3e5-05ff-4aee-be56-38ae8016bd40]" org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:87)
    at org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:119)
    at org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:402)
    at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$3(StreamExecution.scala:352)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
    at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:209)
Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
    at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)
    at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:144)
    at org.apache.spark.rpc.netty.NettyRpcEnv.askAbortable(NettyRpcEnv.scala:242)
    at org.apache.spark.rpc.netty.NettyRpcEndpointRef.askAbortable(NettyRpcEnv.scala:555)
    at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:559)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:102)
    ... 8 more

Process finished with exit code 0

I have tried multiple approaches to this and the farthest I've gotten is writing it in the console inside the IDEA. My different approaches have resulted in an error, or the code runs successfully but the consumer does not receive the excepted result.

I was able to solve the errors in my code using this video:

https://youtu.be/OPTMje7wKmU

All the credit goes to its creator.

Problem 1: My IntelliJ IDEA was not communicating properly with Kafka in WSL as pointed out by @OneCricketeer. I solved this by running Kafka under windows and not WSL.

Problem 2: My code was full of errors. I believe I was taking an incorrect approach towards what I was trying to achieve due to my lack of knowledge in the subject. Thanks to the video above I was able to edit my code to do what I originally intended which was to send messages from the IDE to be consumed by a Kafka consumer client.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM