簡體   English   中英

是否可以使用火花流 stream 數據庫表數據

[英]is it possible to stream a database table data using spark streaming

試圖 stream SQLServer 表數據。 所以,已經創建了一個簡單的 java 程序,主要是 class。 創建了一個 sparkconf 並使用它,啟動了一個 JavaStreamingContext 並從中檢索 SparkContext。 使用 Spark API 的 JdbcRDD 和 JavaRDD 從數據庫接收數據並啟動一個 inputQueue,然后准備 JavaInputDStream。 所以完成了先決條件並啟動了JavaStreamingContext。 因此,我得到了我在准備 inputQueue 時收到的第一組數據,但沒有為進一步的流獲取數據。

package com.ApacheSparkConnection.ApacheSparkConnection;

import java.io.Serializable;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.Queue;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.rdd.JdbcRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import com.fasterxml.jackson.databind.deser.std.StringDeserializer;
import com.infosys.himi.maskit.algorithms.encryptiondecryption.EncryptionARC4;
import com.infosys.maskit.common.util.ConfigParams;

import scala.Tuple2;
import scala.reflect.ClassManifestFactory$;
import scala.runtime.AbstractFunction0;
import scala.runtime.AbstractFunction1;

public class MainSparkConnector {

    public static void main(String[] args) throws Exception {

        String dbtableQuery = "SELECT TOP 10 AGENT_CODE,AGENT_NAME,WORKING_AREA,COMMISSION,PHONE_NO,COUNTRY FROM dbo.AGENTS where AGENT_CODE >= ? and AGENT_CODE <= ?";

        String host = "XXXXXXXXX";
        String databaseName = "YYYY";
        String user = "sa";
        String password = "XXXXXX@123";

        long previewSize = 0; 

        Instant start = Instant.now();

        SparkConf sparkConf = new SparkConf().setAppName("SparkJdbcDs")
                .setMaster("local[4]")
                .set("spark.driver.allowMultipleContexts", "true");

        JavaStreamingContext javaStreamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(10));
        JavaSparkContext javaSparkContext  =  javaStreamingContext.sparkContext();
        SparkContext sparkContext = javaSparkContext.sc(); 

        String url = "jdbc:sqlserver://" + host + ":1433;databaseName=" + databaseName;
        String driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"; 

        DbConnection dbConnection = new DbConnection(driver, url, user, password);

        JdbcRDD<Object[]> jdbcRDD =
                new JdbcRDD<Object[]>(sparkContext, dbConnection, dbtableQuery, 0,
                              100000, 10, new MapResult(), ClassManifestFactory$.MODULE$.fromClass(Object[].class));

        JavaRDD<Object[]> javaRDD = JavaRDD.fromRDD(jdbcRDD, ClassManifestFactory$.MODULE$.fromClass(Object[].class));

        List<String> employeeFullNameList = javaRDD.map(new Function<Object[], String>() {
            @Override
            public String call(final Object[] record) throws Exception {
                String rec = "";
                for(Object ob : record) {
                    rec = rec + " " + ob;
                }
                return rec;
            }
        }).collect();

        JavaRDD<String> javaRDD1 = javaStreamingContext.sparkContext().parallelize(employeeFullNameList);
        Queue<JavaRDD<String>> inputQueue = new LinkedList<JavaRDD<String>>();

        inputQueue.add(javaRDD1);

        JavaInputDStream<String> javaDStream = javaStreamingContext.queueStream(inputQueue, true);
        System.out.println("javaDStream.print()");
        javaDStream.print();
        javaDStream.foreachRDD( rdd-> {
            System.out.println("rdd.count() : "+ rdd.count());
            rdd.collect().stream().forEach(n-> System.out.println("item of list: "+n));
        });
        javaStreamingContext.start();

        System.out.println("employeeFullNameList.size() : "+employeeFullNameList.size());

        javaStreamingContext.awaitTermination();
    }

    static class DbConnection extends AbstractFunction0<Connection> implements Serializable {

        private String driverClassName;
        private String connectionUrl;
        private String userName;
        private String password;

        public DbConnection(String driverClassName, String connectionUrl, String userName, String password) {
            this.driverClassName = driverClassName;
            this.connectionUrl = connectionUrl;
            this.userName = userName;
            this.password = password;
        }

        public Connection apply() {
            try {
                Class.forName(driverClassName);
            } catch (ClassNotFoundException e) {
                System.out.println("Failed to load driver class" +e);
            }

            Properties properties = new Properties();
            properties.setProperty("user", userName);
            properties.setProperty("password", password);

            Connection connection = null;
            try {
                connection = DriverManager.getConnection(connectionUrl, properties);
            } catch (SQLException e) {
                System.out.println("Connection failed"+ e);
            }

            return connection;
        }
    }

    static class MapResult extends AbstractFunction1<ResultSet, Object[]> implements Serializable {

        public Object[] apply(ResultSet row) {
            return JdbcRDD.resultSetToObjectArray(row);
        }
    }
}````
Please let me know if am in wrong direction

通過 Spark Streaming 流式處理 RDBMS 的初始數據快照很容易,但沒有直接的方法可以獲取數據庫中發生的尾隨變化。

更好的解決方案是 go 通過Debezium SQL 服務器連接器

Debezium 的 SQL 服務器連接器可以監控和記錄 SQL 服務器數據庫的模式中的行級更改。

  • 您將需要設置一個 Kafka 集群
  • 為 SQL 服務器啟用 CDC

SQL 服務器 CDC 並非旨在存儲數據庫更改的完整歷史記錄。 因此,Debezium 有必要建立當前數據庫內容的基線並將其流式傳輸到 Kafka。 這是通過稱為快照的過程實現的。

默認情況下(快照模式初始),連接器將在第一次啟動時執行數據庫的初始一致快照(意味着要根據連接器的過濾器配置捕獲的任何表中的結構和數據)。

每個快照都包含以下步驟:

Determine the tables to be captured

Obtain a lock on each of the monitored tables to ensure that no structural changes can occur to any of the tables. The level of the lock is determined by snapshot.isolation.mode configuration option.

Read the maximum LSN ("log sequence number") position in the server’s transaction log.

Capture the structure of all relevant tables.

Optionally release the locks obtained in step 2, i.e. the locks are held usually only for a short period of time.

Scan all of the relevant database tables and schemas as valid at the LSN position read in step 3, and generate a READ event for each row and write that event to the appropriate table-specific Kafka topic.

Record the successful completion of the snapshot in the connector offsets.

讀取變更數據表

首次啟動時,連接器對捕獲的表的結構進行結構快照,並將此信息保存在其內部數據庫歷史主題中。 然后連接器為每個源表識別一個更改表並執行主循環

For each change table read all changes that were created between last stored maximum LSN and current maximum LSN

Order the read changes incrementally according to commit LSN and change LSN. This ensures that the changes are replayed by Debezium in the same order as were made to the database.

Pass commit and change LSNs as offsets to Kafka Connect.

Store the maximum LSN and repeat the loop.

重新啟動后,連接器將從之前停止的偏移量(提交和更改 LSN)恢復。

連接器能夠在運行時檢測是否為列入白名單的源表啟用或禁用了 CDC,並修改其行為。

SQL 服務器連接器將單個表上所有插入、更新和刪除操作的事件寫入單個 Kafka 主題。 Kafka 主題的名稱始終采用 serverName.schemaName.tableName 的形式,其中 serverName 是使用 database.server.name 配置屬性指定的連接器的邏輯名稱,schemaName 是發生操作的模式的名稱,並且tableName 是發生操作的數據庫表的名稱。

例如,考慮一個 SQL 服務器安裝,其庫存數據庫包含四個表: products, products_on_hand, customers, and orders模式dbo中的訂單。 如果監控這個數據庫的連接器被賦予了一個執行的邏輯服務器名稱,那么連接器將在這四個 Kafka 主題上產生事件:

    fulfillment.dbo.products
    fulfillment.dbo.products_on_hand
    fulfillment.dbo.customers

    fulfillment.dbo.orders

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM