DStream [Class] Spark Streaming的reduceByKey / aggregateByKey替代

Question

不过已经有了类似的问题在这里，但它使用Maven，和我使用的sbt 。 此外，那里没有任何解决方案对我有用

我正在使用Spark 2.4.0，Scala 2.11.12和IntelliJ IDEA 2019.1

我的build.sbt看起来像：

libraryDependencies ++= Seq(
    "com.groupon.sparklint" %% "sparklint-spark212" % "1.0.12" excludeAll ExclusionRule(organization = "org.apache.spark"),
    "org.apache.spark" %% "spark-core" % "2.4.0",
    "org.apache.spark" %% "spark-sql" % "2.4.0",
    "org.apache.spark" %% "spark-streaming" % "2.4.0",
    "org.apache.spark" %% "spark-streaming-kafka" % "1.6.2",
    "com.datastax.spark" %% "spark-cassandra-connector" % "2.4.0",
    "com.typesafe.slick" %% "slick" % "3.3.0",
    "org.slf4j" % "slf4j-nop" % "1.6.4",
    "com.typesafe.slick" %% "slick-hikaricp" % "3.3.0",
    "com.typesafe.slick" %% "slick-extensions" % "3.0.0"
)

全部编辑：

我将从Kafka接收数据流，该数据流将使用以下命令发送到Spark Streaming上下文：

val rawWeatherStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

由此，我想创建一个RawWeatherData对象流。 流中的样本输出如下所示：

（NULL，725030：14732,2008,12,31，11,0.6，-6.7,1001.7,80,6.2,8，0.0，0.0）

一切看起来都很好，除了我需要删除第一个null值以创建RawWeatherData对象流之外，因为构造函数无法接受第一个null值，但可以接受该流中的所有其他值。

只是为了清楚起见，这是RawWeatherData外观（我无法编辑）：

case class RawWeatherData(
                           wsid: String,
                           year: Int,
                           month: Int,
                           day: Int,
                           hour: Int,
                           temperature: Double,
                           dewpoint: Double,
                           pressure: Double,
                           windDirection: Int,
                           windSpeed: Double,
                           skyCondition: Int,
                           skyConditionText: String,
                           oneHourPrecip: Double,
                           sixHourPrecip: Double) extends WeatherModel

为了实现该目的，我将流发送到一个函数中，该函数向我返回所需的RawWeatherData对象流：

def ingestStream(rawWeatherStream: InputDStream[(String, String)]): DStream[RawWeatherData] = {
    rawWeatherStream.map(_._2.split(",")).map(RawWeatherData(_))
}

现在，我正在寻找将此流插入MySQL / DB2数据库。 从这个RawWeatherData对象（725030：14732,2008,12,31，11,0.6，-6.7,1001.7,80,6.2,8，0.0，0.0），左突出加粗部分是主键，和右加粗部分是必须减少/汇总的值。

所以从本质上讲，我希望我的DStream具有键-值对([725030:14732,2008,12,31] , <summed up values for the key>)

所以在ingestStream之后，我尝试执行以下操作：

parsedWeatherStream.map { weather =>
        (weather.wsid, weather.year, weather.month, weather.day, weather.oneHourPrecip)
    }.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

映射结束后，我尝试编写.reduceByKey() ，但是当我尝试这样做时，错误提示Cannot resolve symbol reduceByKey`。 我不确定为什么会发生这种情况，因为该功能在spark文档中可用。

PS。 现在weather.oneHourPrecip设置为在cassandra中counter ，因此cassandra会自动为我汇总该值。 但这在其他数据库（如DB2）中是不可能的，因此我想要一个适当的替换，例如reduceByKey中的reduceByKey 。 有什么办法可以处理这种情况？

Answer 1

流的类型为DStream[RawWeatherData] ， reduceByKey仅在类型为DStream[(K,V)]流上可用，该流是由键和值组成的元组流。

您想要做的可能是使用mapValues而不是map ：

 val parsedWeatherStream: DStream[(String, RawWeatherData)] = rawWeatherStream
     .mapValues(_.split(","))
     .mapValues(RawWeatherData)

正如您从上面的代码片段中通过parsedWeatherStream的类型可以看到的parsedWeatherStream ，如果您使用mapValues ，则不会丢弃密钥，而可以使用reduceByKey 。

DStream [Class] Spark Streaming的reduceByKey / aggregateByKey替代

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-04-28 18:23:02

DStream [Class] Spark Streaming的reduceByKey / aggregateByKey替代

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-04-28 18:23:02

解决方案1
0 已采纳 2019-04-28 18:23:02