![](/img/trans.png)
[英]How to get batch ID in Kafka output from Spark Structured Streaming
[英]How to get only values from kafka sources to spark?
我從kafka來源獲取日志,並將其激起火花。
保存在我的hadoop_path中的日志格式如下所示
{"value":"{\\"Name\\":\\"Amy\\",\\"Age\\":\\"22\\"}"}
{"value":"{\\"Name\\":\\"Jin\\",\\"Age\\":\\"26\\"}"}
但是,我想讓它像
{\\"Name\\":\\"Amy\\",\\"Age\\":\\"22\\"}
{\\"Name\\":\\"Jin\\",\\"Age\\":\\"26\\"}
任何一種解決方案都會很棒。 (使用純Java代碼,Spark SQL或Kafka)
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MYApp").getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)");
StreamingQuery queryone = dg.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
使用以下內容:
Dataframe<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
df.printSchema();
StreamingQuery queryone = df.selectExpr("CAST(value AS STRING)")
.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
確保架構包含value
作為列。
您可以使用Spark獲得預期的結果,如下所示:
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MYApp").getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)")
.withColumn("Name", functions.json_tuple(functions.col("value"),"Name"))
.withColumn("Age", functions.json_tuple(functions.col("value"),"Age"));
StreamingQuery queryone = dg.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
基本上,您必須為value列中json字符串內的每個字段創建單獨的列。
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MYApp").getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)");
Dataset<Row> dz = dg.select(
from_json(dg.col("value"), DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("Name", StringType,true)
})).getField("Name").alias("Name")
,from_json(dg.col("value"), DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("Age", IntegerType,true)
})).getField("Age").alias("Age")
StreamingQuery queryone = dg.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.