简体   繁体   English

是否可以使用火花流 stream 数据库表数据

[英]is it possible to stream a database table data using spark streaming

Trying to stream SQLServer table data.试图 stream SQLServer 表数据。 So, Have created a simple java program with main class.所以,已经创建了一个简单的 java 程序,主要是 class。 Created a sparkconf and using that, initiated a JavaStreamingContext and retrieved SparkContext from it.创建了一个 sparkconf 并使用它,启动了一个 JavaStreamingContext 并从中检索 SparkContext。 Using JdbcRDD and JavaRDD of Spark APIs recieved the data from Database and initiated an inputQueue then prepared JavaInputDStream.使用 Spark API 的 JdbcRDD 和 JavaRDD 从数据库接收数据并启动一个 inputQueue,然后准备 JavaInputDStream。 So finished with the prerequisites and started the JavaStreamingContext.所以完成了先决条件并启动了JavaStreamingContext。 So am getting the first set of data which i received while preparing an inputQueue, but not getting the data for further streams.因此,我得到了我在准备 inputQueue 时收到的第一组数据,但没有为进一步的流获取数据。

package com.ApacheSparkConnection.ApacheSparkConnection;

import java.io.Serializable;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.Queue;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.rdd.JdbcRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import com.fasterxml.jackson.databind.deser.std.StringDeserializer;
import com.infosys.himi.maskit.algorithms.encryptiondecryption.EncryptionARC4;
import com.infosys.maskit.common.util.ConfigParams;

import scala.Tuple2;
import scala.reflect.ClassManifestFactory$;
import scala.runtime.AbstractFunction0;
import scala.runtime.AbstractFunction1;

public class MainSparkConnector {

    public static void main(String[] args) throws Exception {

        String dbtableQuery = "SELECT TOP 10 AGENT_CODE,AGENT_NAME,WORKING_AREA,COMMISSION,PHONE_NO,COUNTRY FROM dbo.AGENTS where AGENT_CODE >= ? and AGENT_CODE <= ?";

        String host = "XXXXXXXXX";
        String databaseName = "YYYY";
        String user = "sa";
        String password = "XXXXXX@123";

        long previewSize = 0; 

        Instant start = Instant.now();

        SparkConf sparkConf = new SparkConf().setAppName("SparkJdbcDs")
                .setMaster("local[4]")
                .set("spark.driver.allowMultipleContexts", "true");

        JavaStreamingContext javaStreamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(10));
        JavaSparkContext javaSparkContext  =  javaStreamingContext.sparkContext();
        SparkContext sparkContext = javaSparkContext.sc(); 

        String url = "jdbc:sqlserver://" + host + ":1433;databaseName=" + databaseName;
        String driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"; 

        DbConnection dbConnection = new DbConnection(driver, url, user, password);

        JdbcRDD<Object[]> jdbcRDD =
                new JdbcRDD<Object[]>(sparkContext, dbConnection, dbtableQuery, 0,
                              100000, 10, new MapResult(), ClassManifestFactory$.MODULE$.fromClass(Object[].class));

        JavaRDD<Object[]> javaRDD = JavaRDD.fromRDD(jdbcRDD, ClassManifestFactory$.MODULE$.fromClass(Object[].class));

        List<String> employeeFullNameList = javaRDD.map(new Function<Object[], String>() {
            @Override
            public String call(final Object[] record) throws Exception {
                String rec = "";
                for(Object ob : record) {
                    rec = rec + " " + ob;
                }
                return rec;
            }
        }).collect();

        JavaRDD<String> javaRDD1 = javaStreamingContext.sparkContext().parallelize(employeeFullNameList);
        Queue<JavaRDD<String>> inputQueue = new LinkedList<JavaRDD<String>>();

        inputQueue.add(javaRDD1);

        JavaInputDStream<String> javaDStream = javaStreamingContext.queueStream(inputQueue, true);
        System.out.println("javaDStream.print()");
        javaDStream.print();
        javaDStream.foreachRDD( rdd-> {
            System.out.println("rdd.count() : "+ rdd.count());
            rdd.collect().stream().forEach(n-> System.out.println("item of list: "+n));
        });
        javaStreamingContext.start();

        System.out.println("employeeFullNameList.size() : "+employeeFullNameList.size());

        javaStreamingContext.awaitTermination();
    }

    static class DbConnection extends AbstractFunction0<Connection> implements Serializable {

        private String driverClassName;
        private String connectionUrl;
        private String userName;
        private String password;

        public DbConnection(String driverClassName, String connectionUrl, String userName, String password) {
            this.driverClassName = driverClassName;
            this.connectionUrl = connectionUrl;
            this.userName = userName;
            this.password = password;
        }

        public Connection apply() {
            try {
                Class.forName(driverClassName);
            } catch (ClassNotFoundException e) {
                System.out.println("Failed to load driver class" +e);
            }

            Properties properties = new Properties();
            properties.setProperty("user", userName);
            properties.setProperty("password", password);

            Connection connection = null;
            try {
                connection = DriverManager.getConnection(connectionUrl, properties);
            } catch (SQLException e) {
                System.out.println("Connection failed"+ e);
            }

            return connection;
        }
    }

    static class MapResult extends AbstractFunction1<ResultSet, Object[]> implements Serializable {

        public Object[] apply(ResultSet row) {
            return JdbcRDD.resultSetToObjectArray(row);
        }
    }
}````
Please let me know if am in wrong direction

Streaming RDBMS's snaoshot of initial data via Spark Streaming is easy but there is no direct way of getting the trailing changes happening in DB.通过 Spark Streaming 流式处理 RDBMS 的初始数据快照很容易,但没有直接的方法可以获取数据库中发生的尾随变化。

Better solution would be go via a Debezium SQL Server Connector更好的解决方案是 go 通过Debezium SQL 服务器连接器

Debezium's SQL Server Connector can monitor and record the row-level changes in the schemas of a SQL Server database. Debezium 的 SQL 服务器连接器可以监控和记录 SQL 服务器数据库的模式中的行级更改。

  • You will need to setup a Kafka Cluster您将需要设置一个 Kafka 集群
  • Enable CDC for SQL sever为 SQL 服务器启用 CDC

SQL Server CDC is not designed to store the complete history of database changes. SQL 服务器 CDC 并非旨在存储数据库更改的完整历史记录。 It is thus necessary that Debezium establishes the baseline of current database content and streams it to the Kafka.因此,Debezium 有必要建立当前数据库内容的基线并将其流式传输到 Kafka。 This is achieved via a process called snapshotting.这是通过称为快照的过程实现的。

By default (snapshotting mode initial) the connector will upon the first startup perform an initial consistent snapshot of the database (meaning the structure and data within any tables to be captured as per the connector's filter configuration).默认情况下(快照模式初始),连接器将在第一次启动时执行数据库的初始一致快照(意味着要根据连接器的过滤器配置捕获的任何表中的结构和数据)。

Each snapshot consists of the following steps:每个快照都包含以下步骤:

Determine the tables to be captured

Obtain a lock on each of the monitored tables to ensure that no structural changes can occur to any of the tables. The level of the lock is determined by snapshot.isolation.mode configuration option.

Read the maximum LSN ("log sequence number") position in the server’s transaction log.

Capture the structure of all relevant tables.

Optionally release the locks obtained in step 2, i.e. the locks are held usually only for a short period of time.

Scan all of the relevant database tables and schemas as valid at the LSN position read in step 3, and generate a READ event for each row and write that event to the appropriate table-specific Kafka topic.

Record the successful completion of the snapshot in the connector offsets.

Reading the change data tables读取变更数据表

Upon first start-up, the connector takes a structural snapshot of the structure of the captured tables and persists this information in its internal database history topic.首次启动时,连接器对捕获的表的结构进行结构快照,并将此信息保存在其内部数据库历史主题中。 Then the connector identifies a change table for each of the source tables and executes the main loop然后连接器为每个源表识别一个更改表并执行主循环

For each change table read all changes that were created between last stored maximum LSN and current maximum LSN

Order the read changes incrementally according to commit LSN and change LSN. This ensures that the changes are replayed by Debezium in the same order as were made to the database.

Pass commit and change LSNs as offsets to Kafka Connect.

Store the maximum LSN and repeat the loop.

After a restart, the connector will resume from the offset (commit and change LSNs) where it left off before.重新启动后,连接器将从之前停止的偏移量(提交和更改 LSN)恢复。

The connector is able to detect whether the CDC is enabled or disabled for whitelisted source table during the runtime and modify its behavior.连接器能够在运行时检测是否为列入白名单的源表启用或禁用了 CDC,并修改其行为。

The SQL Server connector writes events for all insert, update, and delete operations on a single table to a single Kafka topic. SQL 服务器连接器将单个表上所有插入、更新和删除操作的事件写入单个 Kafka 主题。 The name of the Kafka topics always takes the form serverName.schemaName.tableName, where serverName is the logical name of the connector as specified with the database.server.name configuration property, schemaName is the name of the schema where the operation occurred, and tableName is the name of the database table on which the operation occurred. Kafka 主题的名称始终采用 serverName.schemaName.tableName 的形式,其中 serverName 是使用 database.server.name 配置属性指定的连接器的逻辑名称,schemaName 是发生操作的模式的名称,并且tableName 是发生操作的数据库表的名称。

For example, consider a SQL Server installation with an inventory database that contains four tables: products, products_on_hand, customers, and orders in schema dbo .例如,考虑一个 SQL 服务器安装,其库存数据库包含四个表: products, products_on_hand, customers, and orders模式dbo中的订单。 If the connector monitoring this database were given a logical server name of fulfillment, then the connector would produce events on these four Kafka topics:如果监控这个数据库的连接器被赋予了一个执行的逻辑服务器名称,那么连接器将在这四个 Kafka 主题上产生事件:

    fulfillment.dbo.products
    fulfillment.dbo.products_on_hand
    fulfillment.dbo.customers

    fulfillment.dbo.orders

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用火花流从数据库中读取流 - Stream reading from database using spark streaming Kafka和TextSocket Stream中的Spark Streaming数据传播 - Spark Streaming data dissemination in Kafka and TextSocket Stream 在Spark Streaming中使用Java对有序的Spark流进行迭代编程? - Iterative programming on an ordered spark stream using Java in Spark Streaming? 在Spark上使用Twitter Streaming API无法获得Tweets流 - Can't get a stream of Tweets using Twitter Streaming API on Spark 如何使用Spark结构化流为Kafka流实现自定义反序列化器? - How to implement custom deserializer for Kafka stream using Spark structured streaming? 如何使用直接流在Kafka Spark Streaming中指定使用者组 - how to specify consumer group in Kafka Spark Streaming using direct stream 在 Spark 流中,是否可以将批处理数据从 kafka 插入到 Hive? - In Spark streaming, Is it possible to upsert batch data from kafka to Hive? Spark Streaming:如何有效地将foreachRDD数据保存到Mysql数据库中? - Spark Streaming: How to efficiently save foreachRDD data into Mysql Database? 尝试使用Spark Streaming连接Cassandra数据库时出错 - Error while trying to connect cassandra database using spark streaming 在Spark Streaming中使用Spark SQL - Using Spark SQL with Spark Streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM