[英]How to stream data from Kafka topic to Delta table using Spark Structured Streaming
[英]Error: Using Spark Structured Streaming to read and write data to another topic in kafka
我正在做一个使用 kafka 主题读取 access_logs 文件的小任务,然后我计算状态并将状态计数发送到另一个 kafka 主题。 但我不断收到错误,虽然我没有使用 output 模式或 append 模式:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
使用完整模式时:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: requirement failed: KafkaTable does not support Complete mode.
这是我的代码: structuredStreaming.scala
package com.spark.sparkstreaming
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.functions._
import java.util.regex.Pattern
import java.util.regex.Matcher
import java.text.SimpleDateFormat
import java.util.Locale
import Utilities._
object structuredStreaming {
case class LogEntry(ip:String, client:String, user:String, dateTime:String, request:String, status:String, bytes:String, referer:String, agent:String)
val logPattern = apacheLogPattern()
val datePattern = Pattern.compile("\\[(.*?) .+]")
def parseDateField(field: String): Option[String] = {
val dateMatcher = datePattern.matcher(field)
if (dateMatcher.find) {
val dateString = dateMatcher.group(1)
val dateFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH)
val date = (dateFormat.parse(dateString))
val timestamp = new java.sql.Timestamp(date.getTime());
return Option(timestamp.toString())
} else {
None
}
}
def parseLog(x:Row) : Option[LogEntry] = {
val matcher:Matcher = logPattern.matcher(x.getString(0));
if (matcher.matches()) {
val timeString = matcher.group(4)
return Some(LogEntry(
matcher.group(1),
matcher.group(2),
matcher.group(3),
parseDateField(matcher.group(4)).getOrElse(""),
matcher.group(5),
matcher.group(6),
matcher.group(7),
matcher.group(8),
matcher.group(9)
))
} else {
return None
}
}
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.config("spark.sql.streaming.checkpointLocation", "/home/UDHAV.MAHATA/Documents/Checkpoints")
.getOrCreate()
setupLogging()
// val rawData = spark.readStream.text("/home/UDHAV.MAHATA/Documents/Spark/logs")
val rawData = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testing")
.load()
import spark.implicits._
val structuredData = rawData.flatMap(parseLog).select("status")
val windowed = structuredData.groupBy($"status").count()
//val query = windowed.writeStream.outputMode("complete").format("console").start()
val query = windowed
.writeStream
.outputMode("complete")
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "sink")
.start()
query.awaitTermination()
spark.stop()
}
}
实用程序.scala
package com.spark.sparkstreaming
import org.apache.log4j.Level
import java.util.regex.Pattern
import java.util.regex.Matcher
object Utilities {
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
def apacheLogPattern():Pattern = {
val ddd = "\\d{1,3}"
val ip = s"($ddd\\.$ddd\\.$ddd\\.$ddd)?"
val client = "(\\S+)"
val user = "(\\S+)"
val dateTime = "(\\[.+?\\])"
val request = "\"(.*?)\""
val status = "(\\d{3})"
val bytes = "(\\S+)"
val referer = "\"(.*?)\""
val agent = "\"(.*?)\""
val regex = s"$ip $client $user $dateTime $request $status $bytes $referer $agent"
Pattern.compile(regex)
}
}
谁能帮我解决我在哪里做错了?
正如错误消息所暗示的那样,您需要在分组中添加水印。
替换此行
val windowed = structuredData.groupBy($"status").count()
和
import org.apache.spark.sql.functions.{window, col}
val windowed = structuredData.groupBy(window(col("dateTime"), "10 minutes"), "status").count()
如果我正确理解了您的代码,重要的是dateTime
列的类型是timestamp
,无论如何您都可以从 Kafka 源中解析它。
如果没有 window,Spark 将不知道要聚合多少数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.