简体   繁体   English

Spark Streaming 数据帧持久化操作

[英]Spark Streaming Data-frame Persist Operation

I am reading Oracle database from my spark code and I persist it - (cache operation).我正在从我的 spark 代码中读取 Oracle 数据库,并将其持久化 - (缓存操作)。

  val dataOracle = spark.read
  .format("jdbc")
  .option("url",conn_url)
  .option("dbtable", s"(select * from table)")
  .option("user", oracle_user)
  .option("password", oracle_pass)
  .option("driver",oracle_driver)
  .load().persist()

End of the code, I need unpersist this dataframe, cause it can be happened some changes in database and I need those data in the next cycle, but at the same time time cost so important to me.代码结束,我需要取消持久化这个数据帧,因为它可能会在数据库中发生一些变化,我在下一个周期需要这些数据,但同时成本对我来说非常重要。 If I cache the dataframe my code takes under the 1 second, if I dont above 3 second(which is not acceptable).如果我缓存数据帧,我的代码需要在 1 秒以下,如果我没有超过 3 秒(这是不可接受的)。 Is there any strategy to get latest data from DB, also minimized time cost value!有没有什么策略可以从数据库中获取最新数据,同时最大限度地减少时间成本价值!

There is the my main operation using Oracle data:我的主要操作是使用 Oracle 数据:

dataOracle.createOrReplaceTempView("TABLE")
val total = spark.sql(s"select count(*) from TABLE where name = ${name}").first().getLong(0)
val items = spark.sql(s"SELECT count(*) from TABLE where index = ${id} and name = ${name}").first().getLong(0)
val first_rule: Double = total.toDouble / items.toDouble

If your dataframe is updated and you need those updates, then by definition you can't cache anything and you just need to read it all over again.如果您的数据框已更新并且您需要这些更新,那么根据定义,您无法缓存任何内容,您只需要重新阅读它。 A possible way to optimize is to add a column of last modified timestamp to your table in the database and only read those entries where the last modified timestamp is greater than some value.一种可能的优化方法是在数据库中的表中添加一列上次修改的时间戳,并仅读取上次修改的时间戳大于某个值的那些条目。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM