简体   繁体   English

是rdd.contains函数在spark-scala中很昂贵

[英]is rdd.contains function in spark-scala expensive

I am getting millions of message from Kafka stream in spark-streaming. 我从星火流中的Kafka流中收到了数百万条消息。 There are 15 different types of message. 有15种不同类型的消息。 Messages come from a single topic. 消息来自单个主题。 I can only differentiate message by its content. 我只能通过其内容来区分消息。 so I am using rdd.contains method to get the different type of rdd. 所以我正在使用rdd.contains方法来获取不同类型的rdd。

sample message 样本信息

{"a":"foo", "b":"bar","type":"first" .......} {“ a”:“ foo”,“ b”:“ bar”,“ type”:“ first” .......}
{"a":"foo1", "b":"bar1","type":"second" .......} {“ a”:“ foo1”,“ b”:“ bar1”,“ type”:“ second” .......}
{"a":"foo2", "b":"bar2","type":"third" .......} {“ a”:“ foo2”,“ b”:“ bar2”,“ type”:“ third” .......}
{"a":"foo", "b":"bar","type":"first" .......} {“ a”:“ foo”,“ b”:“ bar”,“ type”:“ first” .......}
.............. ..............
............... ...............
......... .........
so on 依此类推

code

DStream.foreachRDD { rdd =>
  if (!rdd.isEmpty()) {
    val rdd_first = rdd.filter {
      ele => ele.contains("First")
    }
    if (!rdd_first.isEmpty()) {
      insertIntoTableFirst(hivecontext.read.json(rdd_first)) 
    }
    val rdd_second = rdd.filter {
      ele => ele.contains("Second")
    }
    if (!rdd_second.isEmpty()) {
     insertIntoTableSecond(hivecontext.read.json(rdd_second))
    }
         .............
         ......
    same way for 15 different rdd

is there any way to get different rdd from kafka topic message? 有什么办法可以从kafka主题消息中获得不同的rdd?

There's no rdd.contains . 没有rdd.contains The function contains used here is applied to the String s in the RDD . 这里使用的函数contains应用于RDDString

Like here: 像这儿:

val rdd_first = rdd.filter {
  element => element.contains("First") // each `element` is a String 
}

This method is not robust because other content in the String might meet the comparison, resulting in errors. 此方法不可靠,因为String中的其他内容可能满足比较要求,从而导致错误。

eg 例如

{"a":"foo", "b":"bar","type":"second", "c": "first", .......}

One way to deal with this would be to first transform the JSON data into proper records, and then apply grouping or filtering logic on those records. 解决此问题的一种方法是,首先将JSON数据转换为适当的记录,然后对这些记录应用分组或过滤逻辑。 For that, we first need a schema definition of the data. 为此,我们首先需要数据的架构定义。 With the schema, we can parse the records into json and apply any processing on top of that: 使用该架构,我们可以将记录解析为json并在此之上进行任何处理:

case class Record(a:String, b:String, `type`:String)

import org.apache.spark.sql.types._
val schema = StructType(
               Array(
                StructField("a", StringType, true),
                StructField("b", StringType, true),
                StructField("type", String, true)
               )
             )

val processPerType: Map[String, Dataset[Record] => Unit ] = Map(...) 

stream.foreachRDD { rdd =>
  val records = rdd.toDF("value").select(from_json($"value", schema)).as[Record]
  processPerType.foreach{case (tpe, process) =>
      val target = records.filter(entry => entry.`type` == tpe)
      process(target)
  }
} 

The question does not specify what kind of logic needs to be applied to each type of record. 该问题未指定每种记录类型需要采用哪种逻辑。 What's presented here is a generic way of approaching the problem where any custom logic can be expressed as a function Dataset[Record] => Unit . 这里介绍的是解决任何自定义逻辑都可以表示为Dataset[Record] => Unit函数的通用方法。

If the logic could be expressed as an aggregation, probably the Dataset aggregation functions will be more appropriate. 如果逻辑可以表示为聚合,则Dataset聚合功能可能更合适。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM