简体   繁体   English

在VM中部署的独立群集中,Spark流不起作用

[英]Spark streaming is not working in Standalone cluster deployed in VM

I have written Kafka stream program using Scala and executing in Spark standalone cluster. 我使用Scala编写了Kafka流程序并在Spark独立集群中执行。 Code works fine in my local. 代码在我的本地工作正常。 I have done Kafka , Cassandra and Spark setup in Azure VM. 我在Azure VM中完成了Kafka,Cassandra和Spark设置。 I have opened all inbound and outbound ports to avoid port blocking. 我已打开所有入站和出站端口以避免端口阻塞。

started Master 开始大师

sbin>./start-master.sh sbin目录> ./ start-master.sh

Started Slave 开始奴隶

sbin# ./start-slave.sh spark://vm-hostname:7077 sbin#。/ start-slave.sh spark:// vm-hostname:7077

I have verified this status in Master WEB UI. 我已在Master WEB UI中验证了此状态。

Submit Job 提交工作

bin#./spark-submit --class xyStreamJob --master spark://vm-hostname:7077 /home/user/appl.jar bin#。/ spark-submit --class xyStreamJob --master spark:// vm-hostname:7077 /home/user/appl.jar

I noticed that Application added and displayed in Master WEB UI. 我注意到在WEB WEB UI中添加并显示了Application。

I have published few messages to topic and messages are not received and persisted to Cassandra DB. 我已经向主题发布了一些消息,并且没有收到消息并将其保存到Cassandra DB。

I clicked the Application name on master web console and noticed that Streaming tab is not available in that application console page . 我在主Web控制台上单击了应用程序名称,发现该应用程序控制台页面中的Streaming选项卡不可用

Why application is not working in VM and working good in local ? 为什么应用程序不能在VM中运行并且在本地运行良好?

How to debug the issue in VM ? 如何在VM中调试问题?

def main(args: Array[String]): Unit = {
    val spark = SparkHelper.getOrCreateSparkSession()
    val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
    spark.sparkContext.setLogLevel("WARN")
    val kafkaStream = {
      val kafkaParams = Map[String, Object](
        "bootstrap.servers" -> 
                "vmip:9092",
        "key.deserializer" -> classOf[StringDeserializer],
        "value.deserializer" -> classOf[StringDeserializer],
        "group.id" -> "loc",
        "auto.offset.reset" -> "latest",
        "enable.auto.commit" -> (false: java.lang.Boolean)
      )

      val topics = Array("hello")
      val numPartitionsOfInputTopic = 3
      val streams = (1 to numPartitionsOfInputTopic) map {
        _ => KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) )
      }
     streams
    }


    kafkaStream.foreach(rdd=> {
      rdd.foreachRDD(conRec=> {
        val offsetRanges = conRec.asInstanceOf[HasOffsetRanges].offsetRanges
        conRec.foreach(str=> {
          try {
            println(str.value().trim)
            CassandraHelper.saveItemEvent(str.value().trim)

          }catch {
            case ex: Exception => {
              println(ex.getMessage)
            }
          }
        })
        rdd.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      })
      println("Read Msg")
    })
    println(" Spark parallel reader is ready !!!")
    ssc.start()
    ssc.awaitTermination()
  }

  def getSparkConf(): SparkConf = {
    val conf = new SparkConf(true)
      .setAppName("TestAppl")
      .set("spark.cassandra.connection.host", "vmip")
      .set("spark.streaming.stopGracefullyOnShutdown","true")
    .setMaster("spark://vm-hostname:7077")

    conf
  }

Version

scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
val connectorVersion = "2.0.7"


libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion %"provided",
  "org.apache.spark" %% "spark-sql" % sparkVersion  %"provided",
  "org.apache.spark" %% "spark-hive" % sparkVersion %"provided",
  "com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion  ,
  "org.apache.kafka" %% "kafka" % "0.10.1.0",
  "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
  "org.apache.spark" %% "spark-streaming" %  sparkVersion  %"provided",
)
mergeStrategy in assembly := {
  case PathList("org", "apache", "spark", "unused", "UnusedStubClass.class") => MergeStrategy.first
  case x => (mergeStrategy in assembly).value(x)
}

To debug your issue, the first think would be to make sure that messages go through Kafka. 要调试您的问题,首先要考虑的是确保消息通过Kafka。 To do so you need to have port 9092 open on your VM and try consuming directly from Kafka 为此,您需要在VM上打开端口9092并尝试直接从Kafka使用

bin/kafka-console-consumer.sh --bootstrap-server vmip:9092 --topic hello --from-beginning

from-beginning option will consume everything up to the max retention time you configured on your Kafka topic. 从开始选项将消耗您在Kafka主题上配置的最长保留时间。

Check as well that you don't have 2 versions of Spark in your VM, and that you need to use "spark2-submit" to submit a Spark2 Job. 还要检查您的VM中没有2个版本的Spark,并且需要使用“spark2-submit”提交Spark2作业。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM