简体   繁体   English

Spark Streaming 使用 JDBC 源和 Redis Stream

[英]Spark Streaming With JDBC Source and Redis Stream

I'm trying to build a little mix of technologies to implement a solution on my work.我正在尝试构建一些技术组合以在我的工作中实施解决方案。 Since I'm new to most of them, sometimes I got stuck, but could find solution to some of the problems I'm facing.由于我对他们中的大多数人都是新手,所以有时我会被卡住,但可以找到解决我面临的一些问题的方法。 Right now, both objects are running on Spark, but I can't seem to identify why the Streaming are not working.现在,这两个对象都在 Spark 上运行,但我似乎无法确定 Streaming 不起作用的原因。

Maybe is the way redis implements its sink on the writing to stream side, maybe is the way I'm trying to do the job.也许是 redis 在写入 stream 端时实现其接收器的方式,也许是我试图完成这项工作的方式。 Almost all of the examples I found on streaming are related to Spark samples, like streaming text or TCP, and the only solution I found on relational databases are based on kafka connect, which I can't use right now because the company doesn't have the Oracle option to CDC on Kafka.我在流式传输上找到的几乎所有示例都与 Spark 示例有关,例如流式传输文本或 TCP,而我在关系数据库上找到的唯一解决方案是基于 kafka connect,我现在无法使用它,因为该公司没有在 Kafka 上有 CDC 的 Oracle 选项。

My scenario is as follows.我的情况如下。 Build a Oracle -> Redis Stream -> MongoDB Spark application.构建 Oracle -> Redis Stream -> Z206E3718AF0921CC1D12F80ACCAE 应用程序。

I've built my code based on the examples of spark redis And used the sample code to try implement a solution to my case.我已经根据spark redis的示例构建了我的代码,并使用示例代码尝试实现我的案例的解决方案。 I load the Oracle data day by day and send to a redis stream which later will be extracted from the stream and saved to Mongo. I load the Oracle data day by day and send to a redis stream which later will be extracted from the stream and saved to Mongo. Right now the sample below is just trying to remove from the stream and show on console, but nothing is shown.现在下面的示例只是试图从 stream 中删除并在控制台上显示,但没有显示任何内容。

The little 'trick' I've tried was to create a CSV directory, read from it, and later grab the date from the csv and use to query the oracle db, then saving the oracle DataFrame on redis with the foreachBatch command. The little 'trick' I've tried was to create a CSV directory, read from it, and later grab the date from the csv and use to query the oracle db, then saving the oracle DataFrame on redis with the foreachBatch command. The data is saved, but I think not in the right way, because using the sample code to read the stream nothing is received.数据保存了,但是我觉得方法不对,因为使用示例代码读取stream什么都没有收到。

Those are the codes:这些是代码:

** Writing to Stream ** ** 写入 Stream **

object SendData extends App {
  Logger.getLogger("org").setLevel(Level.INFO)
  val oracleHost = scala.util.Properties.envOrElse("ORACLE_HOST", "<HOST_IP>")
  val oracleService = scala.util.Properties.envOrElse("ORACLE_SERVICE", "<SERVICE>")
  val oracleUser = scala.util.Properties.envOrElse("ORACLE_USER", "<USER>")
  val oraclePwd = scala.util.Properties.envOrElse("ORACLE_PWD", "<PASSWD>")
  val redisHost = scala.util.Properties.envOrElse("REDIS_HOST", "<REDIS_IP>")
  val redisPort = scala.util.Properties.envOrElse("REDIS_PORT", "6379")
  val oracleUrl = "jdbc:oracle:thin:@//" + oracleHost + "/" + oracleService
  val userSchema = new StructType().add("DTPROCESS", "string")
  val spark = SparkSession
    .builder()
    .appName("Send Data")
    .master("local[*]")
    .config("spark.redis.host", redisHost)
    .config("spark.redis.port", redisPort)
    .getOrCreate()
  val sc = spark.sparkContext
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  import sqlContext.implicits._
  val csvDF = spark.readStream.option("header", "true").schema(userSchema).csv("/tmp/checkpoint/*.csv")
  val output = csvDF
    .writeStream
    .outputMode("update")
    .foreachBatch {(df :DataFrame, batchId: Long) => {
      val dtProcess = df.select(col("DTPROCESS")).first.getString(0).take(10)
      val query = s"""
        (SELECT 
            <FIELDS>
        FROM 
            TABLE
        WHERE
          DTPROCESS BETWEEN (TO_TIMESTAMP('$dtProcess 00:00:00.00', 'YYYY-MM-DD HH24:MI:SS.FF') + 1)
          AND (TO_TIMESTAMP('$dtProcess 23:59:59.99', 'YYYY-MM-DD HH24:MI:SS.FF') + 1)
        ) Table
      """
      val df = spark.read
        .format("jdbc")
        .option("url", oracleUrl)
        .option("dbtable", query)
        .option("user", oracleUser)
        .option("password", oraclePwd)
        .option("driver", "oracle.jdbc.driver.OracleDriver")
        .load()
      df.cache()
      if (df.count() > 0) {
        df.write.format("org.apache.spark.sql.redis")
          .option("table", "process")
          .option("key.column", "PRIMARY_KEY")
          .mode(SaveMode.Append)
          .save()
      }
      if ((new DateTime(dtProcess).toLocalDate()).equals(new LocalDate()))
        Seq(dtProcess).toDF("DTPROCESS")
          .coalesce(1)
          .write.format("com.databricks.spark.csv")
          .mode("overwrite")
          .option("header", "true")
          .save("/tmp/checkpoint")
      else {
        val nextDay = new DateTime(dtProcess).plusDays(1)
        Seq(nextDay.toString(DateTimeFormat.forPattern("YYYY-MM-dd"))).toDF("DTPROCESS")
          .coalesce(1)
          .write.format("com.databricks.spark.csv")
          .mode("overwrite")
          .option("header", "true")
          .save("/tmp/checkpoint")
        }
      }}
    .start()
  output.awaitTermination()
}


** Reading from Stream ** ** 从 Stream 读取 **


object ReceiveData extends App {
  Logger.getLogger("org").setLevel(Level.INFO)
  val mongoPwd = scala.util.Properties.envOrElse("MONGO_PWD", "bpedes")
  val redisHost = scala.util.Properties.envOrElse("REDIS_HOST", "<REDIS_IP>")
  val redisPort = scala.util.Properties.envOrElse("REDIS_PORT", "6379")
  val spark = SparkSession
    .builder()
    .appName("Receive Data")
    .master("local[*]")
    .config("spark.redis.host", redisHost)
    .config("spark.redis.port", redisPort)
    .getOrCreate()
  val processes = spark 
    .readStream
    .format("redis")
    .option("stream.keys", "process")
    .schema(StructType(Array(
      StructField("FIELD_1", StringType),
        StructField("PRIMARY_KEY", StringType),
      StructField("FIELD_3", TimestampType),
      StructField("FIELD_4", LongType),
      StructField("FIELD_5", StringType),
      StructField("FIELD_6", StringType),
      StructField("FIELD_7", StringType),
      StructField("FIELD_8", TimestampType)
    )))
    .load()
  val query = processes
    .writeStream
    .format("console")
    .start()
  query.awaitTermination()
}


This code writes the dataframe to Redis as hashes (not to the Redis stream).此代码将 dataframe 写入 Redis 作为散列(而不是 Redis 流)。

df.write.format("org.apache.spark.sql.redis")
          .option("table", "process")
          .option("key.column", "PRIMARY_KEY")
          .mode(SaveMode.Append)
          .save()

Spark-redis doesn't support writing to Redis stream out of the box. Spark-redis 不支持直接写入 Redis stream。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM