簡體   English   中英

如何僅從kafka來源獲取值以激發火花?

[英]How to get only values from kafka sources to spark?

我從kafka來源獲取日志,並將其激起火花。
保存在我的hadoop_path中的日志格式如下所示
{"value":"{\\"Name\\":\\"Amy\\",\\"Age\\":\\"22\\"}"}
{"value":"{\\"Name\\":\\"Jin\\",\\"Age\\":\\"26\\"}"}

但是,我想讓它像
{\\"Name\\":\\"Amy\\",\\"Age\\":\\"22\\"}
{\\"Name\\":\\"Jin\\",\\"Age\\":\\"26\\"}

任何一種解決方案都會很棒。 (使用純Java代碼,Spark SQL或Kafka)

        SparkSession spark = SparkSession.builder()
                .master("local")
                .appName("MYApp").getOrCreate();
        Dataset<Row> df = spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", Kafka_source)
                .option("subscribe", Kafka_topic)
                .option("startingOffsets", "earliest")
                .option("failOnDataLoss",false)
                .load();
        Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)");
        StreamingQuery queryone = dg.writeStream()
                .format("json")
                .outputMode("append")
                .option("checkpointLocation",Hadoop_path)
                .option("path",Hadoop_path)
                .start();

使用以下內容:

Dataframe<Row> df = spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", Kafka_source)
                .option("subscribe", Kafka_topic)
                .option("startingOffsets", "earliest")
                .option("failOnDataLoss",false)
                .load();
df.printSchema();
StreamingQuery queryone = df.selectExpr("CAST(value AS STRING)")
            .writeStream()
            .format("json")
            .outputMode("append")
            .option("checkpointLocation",Hadoop_path)
            .option("path",Hadoop_path)
            .start();

確保架構包含value作為列。

您可以使用Spark獲得預期的結果,如下所示:

SparkSession spark = SparkSession.builder()
                .master("local")
                .appName("MYApp").getOrCreate();

Dataset<Row> df = spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", Kafka_source)
                .option("subscribe", Kafka_topic)
                .option("startingOffsets", "earliest")
                .option("failOnDataLoss",false)
                .load();

Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)")
        .withColumn("Name", functions.json_tuple(functions.col("value"),"Name"))
        .withColumn("Age", functions.json_tuple(functions.col("value"),"Age"));

StreamingQuery queryone = dg.writeStream()
                .format("json")
                .outputMode("append")
                .option("checkpointLocation",Hadoop_path)
                .option("path",Hadoop_path)
                .start();

基本上,您必須為value列中json字符串內的每個字段創建單獨的列。

我已經完成了from_json功能!

        SparkSession spark = SparkSession.builder()
                .master("local")
                .appName("MYApp").getOrCreate();
        Dataset<Row> df = spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", Kafka_source)
                .option("subscribe", Kafka_topic)
                .option("startingOffsets", "earliest")
                .option("failOnDataLoss",false)
                .load();
        Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)");
        Dataset<Row> dz = dg.select(
                        from_json(dg.col("value"), DataTypes.createStructType(
                        new StructField[] {
                                DataTypes.createStructField("Name", StringType,true)
                        })).getField("Name").alias("Name")
                        ,from_json(dg.col("value"), DataTypes.createStructType(
                        new StructField[] {
                                DataTypes.createStructField("Age", IntegerType,true)
                        })).getField("Age").alias("Age")
        StreamingQuery queryone = dg.writeStream()
                .format("json")
                .outputMode("append")
                .option("checkpointLocation",Hadoop_path)
                .option("path",Hadoop_path)
                .start();

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM