Spark Streaming 数据帧持久化操作

Question

I am reading Oracle database from my spark code and I persist it - (cache operation).我正在从我的 spark 代码中读取 Oracle 数据库，并将其持久化 - （缓存操作）。

  val dataOracle = spark.read
  .format("jdbc")
  .option("url",conn_url)
  .option("dbtable", s"(select * from table)")
  .option("user", oracle_user)
  .option("password", oracle_pass)
  .option("driver",oracle_driver)
  .load().persist()

End of the code, I need unpersist this dataframe, cause it can be happened some changes in database and I need those data in the next cycle, but at the same time time cost so important to me.代码结束，我需要取消持久化这个数据帧，因为它可能会在数据库中发生一些变化，我在下一个周期需要这些数据，但同时成本对我来说非常重要。 If I cache the dataframe my code takes under the 1 second, if I dont above 3 second(which is not acceptable).如果我缓存数据帧，我的代码需要在 1 秒以下，如果我没有超过 3 秒（这是不可接受的）。 Is there any strategy to get latest data from DB, also minimized time cost value!有没有什么策略可以从数据库中获取最新数据，同时最大限度地减少时间成本价值！

There is the my main operation using Oracle data:我的主要操作是使用 Oracle 数据：

dataOracle.createOrReplaceTempView("TABLE")
val total = spark.sql(s"select count(*) from TABLE where name = ${name}").first().getLong(0)
val items = spark.sql(s"SELECT count(*) from TABLE where index = ${id} and name = ${name}").first().getLong(0)
val first_rule: Double = total.toDouble / items.toDouble

Answer 1

If your dataframe is updated and you need those updates, then by definition you can't cache anything and you just need to read it all over again.如果您的数据框已更新并且您需要这些更新，那么根据定义，您无法缓存任何内容，您只需要重新阅读它。 A possible way to optimize is to add a column of last modified timestamp to your table in the database and only read those entries where the last modified timestamp is greater than some value.一种可能的优化方法是在数据库中的表中添加一列上次修改的时间戳，并仅读取上次修改的时间戳大于某个值的那些条目。

Spark Streaming 数据帧持久化操作

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-11-24 13:03:40

Spark Streaming 数据帧持久化操作

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-11-24 13:03:40

解决方案1
0 已采纳 2020-11-24 13:03:40