简体   繁体   English

Spark 结构化流中的提交消息

[英]Commit message in Spark Structured Streaming

I'm using spark sturctured streaming (2.3) and kafka 2.4 version.我正在使用 spark 结构化流式传输 (2.3) 和 kafka 2.4 版本。

I want to kow how can I use ASync and Sync commit offset property.我想知道如何使用ASync and Sync提交偏移量属性。

If I set enable.auto.commit as true, Is it Sync or ASync ?如果我将enable.auto.commit设置为 true,它是Sync or ASync enable.auto.commit

How can I define callback in spark structured streaming ?如何在 spark 结构化流中定义回调? Or how can I use Sync or ASync in Spark structured streaming ?或者如何在 Spark 结构化流中使用Sync or ASync

Thanks in Advance提前致谢

My Code我的代码

package sparkProject;


import java.io.StringReader;
import java.util.*;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder;
import org.apache.spark.sql.catalyst.encoders.RowEncoder;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;

import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;

public class XMLSparkStreamEntry {

    static StructType structType = new StructType();

    static {
        structType = structType.add("FirstName", DataTypes.StringType, false);
        structType = structType.add("LastName", DataTypes.StringType, false);
        structType = structType.add("Title", DataTypes.StringType, false);
        structType = structType.add("ID", DataTypes.StringType, false);
        structType = structType.add("Division", DataTypes.StringType, false);
        structType = structType.add("Supervisor", DataTypes.StringType, false);

    }

    static ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);

    public static void main(String[] args) throws StreamingQueryException {

        SparkConf conf = new SparkConf();
        SparkSession spark = SparkSession.builder().config(conf).appName("Spark Program").master("local[*]")
                .getOrCreate();

        Dataset<Row> ds1 = spark.readStream().format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
                .option("subscribe", "Kafkademo").load();

        Dataset<Row> ss = ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");

        Dataset<Row> finalOP = ss.flatMap(new FlatMapFunction<Row, Row>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Iterator<Row> call(Row t) throws Exception {

                JAXBContext jaxbContext = JAXBContext.newInstance(FileWrapper.class);
                Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();

                StringReader reader = new StringReader(t.getAs("value"));
                FileWrapper person = (FileWrapper) unmarshaller.unmarshal(reader);

                List<Employee> emp = new ArrayList<Employee>(person.getEmployees());
                List<Row> rows = new ArrayList<Row>();
                for (Employee e : emp) {

                    rows.add(RowFactory.create(e.getFirstname(), e.getLastname(), e.getTitle(), e.getId(),
                            e.getDivision(), e.getSupervisor()));

                }
                return rows.iterator();
            }
        }, encoder);


        Dataset<Row> wordCounts = finalOP.groupBy("firstname").count();

        StreamingQuery query = wordCounts.writeStream().outputMode("complete").format("console").start();
        System.out.println("SHOW SCHEMA");
        query.awaitTermination();

    }

}

Can I anyone please check, where and how can I implement ASync and Sync offset commit in my above code ?任何人都可以检查一下,我可以在哪里以及如何在上面的代码中实现 ASync 和 Sync 偏移量提交?

Thanks in Advance..!提前致谢..!

Spark Structured Streaming doesn't support Kafka commit offset feature. Spark Structured Streaming 不支持 Kafka 提交偏移功能。 Suggested option from the official docs is to enable checkpointing.官方文档中的建议选项是启用检查点。
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

Other suggestion is to change it to Spark Streaming, which supports Kafka commitAsync API.另一个建议是将其更改为 Spark Streaming,它支持 Kafka commitAsync API。 https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

Please read https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read This is an excellent source although a little bit of reading between the lines.请阅读https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read这是一个很好的来源,虽然在线。

In short:简而言之:

Structured Streaming ignores the offsets commits in Apache Kafka.结构化流忽略 Apache Kafka 中的偏移量提交。 Instead, it relies on its own offsets management on the driver side which is responsible for distributing offsets to executors and for checkpointing them at the end of the processing round (epoch or micro-batch).相反,它依赖于驱动程序端自己的偏移量管理,该管理器负责将偏移量分配给执行程序,并在处理回合(纪元或微批次)结束时对它们进行检查点。

Batck Spark Structured Streaming & KAFKA Integration works differently again. Batck Spark Structured Streaming & KAFKA Integration 的工作方式再次不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM