是否可以使用火花流 stream 数据库表数据

Question

Trying to stream SQLServer table data.试图 stream SQLServer 表数据。 So, Have created a simple java program with main class.所以，已经创建了一个简单的 java 程序，主要是 class。 Created a sparkconf and using that, initiated a JavaStreamingContext and retrieved SparkContext from it.创建了一个 sparkconf 并使用它，启动了一个 JavaStreamingContext 并从中检索 SparkContext。 Using JdbcRDD and JavaRDD of Spark APIs recieved the data from Database and initiated an inputQueue then prepared JavaInputDStream.使用 Spark API 的 JdbcRDD 和 JavaRDD 从数据库接收数据并启动一个 inputQueue，然后准备 JavaInputDStream。 So finished with the prerequisites and started the JavaStreamingContext.所以完成了先决条件并启动了JavaStreamingContext。 So am getting the first set of data which i received while preparing an inputQueue, but not getting the data for further streams.因此，我得到了我在准备 inputQueue 时收到的第一组数据，但没有为进一步的流获取数据。

package com.ApacheSparkConnection.ApacheSparkConnection;

import java.io.Serializable;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.Queue;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.rdd.JdbcRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import com.fasterxml.jackson.databind.deser.std.StringDeserializer;
import com.infosys.himi.maskit.algorithms.encryptiondecryption.EncryptionARC4;
import com.infosys.maskit.common.util.ConfigParams;

import scala.Tuple2;
import scala.reflect.ClassManifestFactory$;
import scala.runtime.AbstractFunction0;
import scala.runtime.AbstractFunction1;

public class MainSparkConnector {

    public static void main(String[] args) throws Exception {

        String dbtableQuery = "SELECT TOP 10 AGENT_CODE,AGENT_NAME,WORKING_AREA,COMMISSION,PHONE_NO,COUNTRY FROM dbo.AGENTS where AGENT_CODE >= ? and AGENT_CODE <= ?";

        String host = "XXXXXXXXX";
        String databaseName = "YYYY";
        String user = "sa";
        String password = "XXXXXX@123";

        long previewSize = 0; 

        Instant start = Instant.now();

        SparkConf sparkConf = new SparkConf().setAppName("SparkJdbcDs")
                .setMaster("local[4]")
                .set("spark.driver.allowMultipleContexts", "true");

        JavaStreamingContext javaStreamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(10));
        JavaSparkContext javaSparkContext  =  javaStreamingContext.sparkContext();
        SparkContext sparkContext = javaSparkContext.sc(); 

        String url = "jdbc:sqlserver://" + host + ":1433;databaseName=" + databaseName;
        String driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"; 

        DbConnection dbConnection = new DbConnection(driver, url, user, password);

        JdbcRDD<Object[]> jdbcRDD =
                new JdbcRDD<Object[]>(sparkContext, dbConnection, dbtableQuery, 0,
                              100000, 10, new MapResult(), ClassManifestFactory$.MODULE$.fromClass(Object[].class));

        JavaRDD<Object[]> javaRDD = JavaRDD.fromRDD(jdbcRDD, ClassManifestFactory$.MODULE$.fromClass(Object[].class));

        List<String> employeeFullNameList = javaRDD.map(new Function<Object[], String>() {
            @Override
            public String call(final Object[] record) throws Exception {
                String rec = "";
                for(Object ob : record) {
                    rec = rec + " " + ob;
                }
                return rec;
            }
        }).collect();

        JavaRDD<String> javaRDD1 = javaStreamingContext.sparkContext().parallelize(employeeFullNameList);
        Queue<JavaRDD<String>> inputQueue = new LinkedList<JavaRDD<String>>();

        inputQueue.add(javaRDD1);

        JavaInputDStream<String> javaDStream = javaStreamingContext.queueStream(inputQueue, true);
        System.out.println("javaDStream.print()");
        javaDStream.print();
        javaDStream.foreachRDD( rdd-> {
            System.out.println("rdd.count() : "+ rdd.count());
            rdd.collect().stream().forEach(n-> System.out.println("item of list: "+n));
        });
        javaStreamingContext.start();

        System.out.println("employeeFullNameList.size() : "+employeeFullNameList.size());

        javaStreamingContext.awaitTermination();
    }

    static class DbConnection extends AbstractFunction0<Connection> implements Serializable {

        private String driverClassName;
        private String connectionUrl;
        private String userName;
        private String password;

        public DbConnection(String driverClassName, String connectionUrl, String userName, String password) {
            this.driverClassName = driverClassName;
            this.connectionUrl = connectionUrl;
            this.userName = userName;
            this.password = password;
        }

        public Connection apply() {
            try {
                Class.forName(driverClassName);
            } catch (ClassNotFoundException e) {
                System.out.println("Failed to load driver class" +e);
            }

            Properties properties = new Properties();
            properties.setProperty("user", userName);
            properties.setProperty("password", password);

            Connection connection = null;
            try {
                connection = DriverManager.getConnection(connectionUrl, properties);
            } catch (SQLException e) {
                System.out.println("Connection failed"+ e);
            }

            return connection;
        }
    }

    static class MapResult extends AbstractFunction1<ResultSet, Object[]> implements Serializable {

        public Object[] apply(ResultSet row) {
            return JdbcRDD.resultSetToObjectArray(row);
        }
    }
}````
Please let me know if am in wrong direction

Answer 1

Streaming RDBMS's snaoshot of initial data via Spark Streaming is easy but there is no direct way of getting the trailing changes happening in DB.通过 Spark Streaming 流式处理 RDBMS 的初始数据快照很容易，但没有直接的方法可以获取数据库中发生的尾随变化。

Better solution would be go via a Debezium SQL Server Connector更好的解决方案是 go 通过Debezium SQL 服务器连接器

Debezium's SQL Server Connector can monitor and record the row-level changes in the schemas of a SQL Server database. Debezium 的 SQL 服务器连接器可以监控和记录 SQL 服务器数据库的模式中的行级更改。

You will need to setup a Kafka Cluster您将需要设置一个 Kafka 集群
Enable CDC for SQL sever为 SQL 服务器启用 CDC

SQL Server CDC is not designed to store the complete history of database changes. SQL 服务器 CDC 并非旨在存储数据库更改的完整历史记录。 It is thus necessary that Debezium establishes the baseline of current database content and streams it to the Kafka.因此，Debezium 有必要建立当前数据库内容的基线并将其流式传输到 Kafka。 This is achieved via a process called snapshotting.这是通过称为快照的过程实现的。

By default (snapshotting mode initial) the connector will upon the first startup perform an initial consistent snapshot of the database (meaning the structure and data within any tables to be captured as per the connector's filter configuration).默认情况下（快照模式初始），连接器将在第一次启动时执行数据库的初始一致快照（意味着要根据连接器的过滤器配置捕获的任何表中的结构和数据）。

Each snapshot consists of the following steps:每个快照都包含以下步骤：

Determine the tables to be captured

Obtain a lock on each of the monitored tables to ensure that no structural changes can occur to any of the tables. The level of the lock is determined by snapshot.isolation.mode configuration option.

Read the maximum LSN ("log sequence number") position in the server’s transaction log.

Capture the structure of all relevant tables.

Optionally release the locks obtained in step 2, i.e. the locks are held usually only for a short period of time.

Scan all of the relevant database tables and schemas as valid at the LSN position read in step 3, and generate a READ event for each row and write that event to the appropriate table-specific Kafka topic.

Record the successful completion of the snapshot in the connector offsets.

Reading the change data tables读取变更数据表

Upon first start-up, the connector takes a structural snapshot of the structure of the captured tables and persists this information in its internal database history topic.首次启动时，连接器对捕获的表的结构进行结构快照，并将此信息保存在其内部数据库历史主题中。 Then the connector identifies a change table for each of the source tables and executes the main loop然后连接器为每个源表识别一个更改表并执行主循环

For each change table read all changes that were created between last stored maximum LSN and current maximum LSN

Order the read changes incrementally according to commit LSN and change LSN. This ensures that the changes are replayed by Debezium in the same order as were made to the database.

Pass commit and change LSNs as offsets to Kafka Connect.

Store the maximum LSN and repeat the loop.

After a restart, the connector will resume from the offset (commit and change LSNs) where it left off before.重新启动后，连接器将从之前停止的偏移量（提交和更改 LSN）恢复。

The connector is able to detect whether the CDC is enabled or disabled for whitelisted source table during the runtime and modify its behavior.连接器能够在运行时检测是否为列入白名单的源表启用或禁用了 CDC，并修改其行为。

The SQL Server connector writes events for all insert, update, and delete operations on a single table to a single Kafka topic. SQL 服务器连接器将单个表上所有插入、更新和删除操作的事件写入单个 Kafka 主题。 The name of the Kafka topics always takes the form serverName.schemaName.tableName, where serverName is the logical name of the connector as specified with the database.server.name configuration property, schemaName is the name of the schema where the operation occurred, and tableName is the name of the database table on which the operation occurred. Kafka 主题的名称始终采用 serverName.schemaName.tableName 的形式，其中 serverName 是使用 database.server.name 配置属性指定的连接器的逻辑名称，schemaName 是发生操作的模式的名称，并且tableName 是发生操作的数据库表的名称。

For example, consider a SQL Server installation with an inventory database that contains four tables: products, products_on_hand, customers, and orders in schema dbo .例如，考虑一个 SQL 服务器安装，其库存数据库包含四个表： products, products_on_hand, customers, and orders模式dbo中的订单。 If the connector monitoring this database were given a logical server name of fulfillment, then the connector would produce events on these four Kafka topics:如果监控这个数据库的连接器被赋予了一个执行的逻辑服务器名称，那么连接器将在这四个 Kafka 主题上产生事件：

    fulfillment.dbo.products
    fulfillment.dbo.products_on_hand
    fulfillment.dbo.customers

    fulfillment.dbo.orders

是否可以使用火花流 stream 数据库表数据

问题描述

1 个解决方案

解决方案1
0 2020-05-07 10:59:42

是否可以使用火花流 stream 数据库表数据

问题描述

1 个解决方案

解决方案1 0 2020-05-07 10:59:42

解决方案1
0 2020-05-07 10:59:42