简体   繁体   English

如何在 Spark 中更新几条记录

[英]How to update few records in Spark

i have the following program in Scala for the spark:我在 Scala 中有以下用于 spark 的程序:

val dfA = sqlContext.sql("select * from employees where id in ('Emp1', 'Emp2')" )
val dfB = sqlContext.sql("select * from employees where id not in ('Emp1', 'Emp2')" )
val dfN = dfA.withColumn("department", lit("Finance"))
val dfFinal = dfN.unionAll(dfB)
dfFinal.registerTempTable("intermediate_result")

dfA.unpersist
dfB.unpersist
dfN.unpersist
dfFinal.unpersist

val dfTmp = sqlContext.sql("select * from intermediate_result")
dfTmp.write.mode("overwrite").format("parquet").saveAsTable("employees")
dfTmp.unpersist

when I try to save it, I get the following error:当我尝试保存它时,出现以下错误:

org.apache.spark.sql.AnalysisException: Cannot overwrite table employees that is also being read from.; org.apache.spark.sql.AnalysisException:无法覆盖正在读取的表employees 。; at org.apache.spark.sql.execution.datasources.PreWriteCheck.failAnalysis(rules.scala:106) at org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$3.apply(rules.scala:182) at org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$3.apply(rules.scala:109) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:111) at org.apache.spark.sql.execution.datasources.PreWriteCheck.apply(rules.scala:109) at org.apache.spark.sql.execution.datasources.PreWriteCheck.apply(rules.scala:105) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:218) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:218) at scala.collection.immutable.List.foreach(List.scala:318)在 org.apache.spark.sql.execution.datasources.PreWriteCheck.failAnalysis(rules.scala:106) 在 org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$3.apply(rules.scala: 182)在 org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$3.apply(rules.scala:109) 在 org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode. scala:111) 在 org.apache.spark.sql.execution.datasources.PreWriteCheck.apply(rules.scala:109) 在 org.apache.spark.sql.execution.datasources.PreWriteCheck.apply(rules.scala:105)在 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:218) 在 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2。在 scala.collection.immutable.List.foreach(List.scala:318) 处应用(CheckAnalysis.scala:218)

My questions are:我的问题是:

  1. Is my approach correct to change the department of two employees我的方法是否正确更改两名员工的部门
  2. Why am I getting this error when I have released the DataFrames为什么我在发布 DataFrame 时会收到此错误

Is my approach correct to change the department of two employees 我的方法是否正确,以更改两名员工的部门

It is not. 它不是。 Just to repeat something that has been said multiple times on Stack Overflow - Apache Spark is not a database . 只是为了重复在Stack Overflow上多次说过的话-Apache Spark不是数据库 It is not designed for fine grained updates. 它不适用于细粒度的更新。 If your projects requires operation like this, use one of many databases on Hadoop. 如果您的项目需要这样的操作,请使用Hadoop上众多数据库之一。

Why am I getting this error when I have released the DataFrames 发布数据帧后为什么会出现此错误

Because you didn't. 因为你没有。 All you've done is adding a name to the execution plan. 您所要做的就是在执行计划中添加一个名称。 Checkpointing would be the closest thing to "releasing", but you really don't want to end up in situation when you loose executor, in the middle of destructive operation. 检查点是最接近“释放”的东西,但是您确实不希望在破坏性操作的中间松开执行器时陷入困境。

You could write to temporary directory, delete input and move the temporary files, but really - just use a tool which is fit for the job. 您可以写入临时目录,删除输入并移动临时文件,但实际上-只需使用适合该工作的工具即可。

Following is an approach you can try. 以下是您可以尝试的方法。

Instead of using registertemptable api, you can write it into an another table using the saveAsTable api 您可以使用saveAsTable API将其写入另一个表中,而不必使用registertemptable api

dfFinal.write.mode("overwrite").saveAsTable("intermediate_result")

Then, write it into employees table 然后,将其写入员工表

 val dy = sqlContext.table("intermediate_result")
  dy.write.mode("overwrite").insertInto("employees")

Finally, drop intermediate_result table. 最后,删除intermediate_result表。

I would approach it this way, 我会这样处理

>>> df = sqlContext.sql("select * from t")
>>> df.show()
+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
|            2|        Fitness|
|            3|       Footwear|
|            4|        Apparel|
|            5|           Golf|
|            6|       Outdoors|
|            7|       Fan Shop|
+-------------+---------------+

To mimic your flow, I creating 2 dataframes, doing union and writing back to same table t ( deliberately removing department_id = 4 in this example) 为了模拟您的流程,我创建了2个数据帧,进行union并写回到同一表 t (在此示例中,故意删除department_id = 4

>>> df1 = sqlContext.sql("select * from t where department_id < 4")
>>> df2 = sqlContext.sql("select * from t where department_id > 4")
>>> df3 = df1.unionAll(df2)
>>> df3.registerTempTable("df3")
>>> sqlContext.sql("insert overwrite table t select * from df3")
DataFrame[]  
>>> sqlContext.sql("select * from t").show()
+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
|            2|        Fitness|
|            3|       Footwear|
|            5|           Golf|
|            6|       Outdoors|
|            7|       Fan Shop|
+-------------+---------------+
  1. Extract RDD and schema from DataFrame.从 DataFrame 中提取 RDD 和模式。

  2. Create new clone DataFame.创建新的克隆 DataFame。

  3. Overwrite table覆盖表

private def overWrite(df: DataFrame): Unit = { val schema = df.schema val rdd = df.rdd val dfForSave = spark.createDataFrame(rdd, schema) dfForSave.write .mode(SaveMode.Overwrite) .insertInto(s"${tableSource.schema}.${tableSource.table}") }

Lets say it is a hive table you are reading and overwriting . 可以说这是您正在读取和覆盖的配置单元表

Please introduce the timestamp to the hive table location as follows 请按如下所示将时间戳记添加到配置单元表位置

    create table table_name (
  id                int,
  dtDontQuery       string,
  name              string
)
 Location hdfs://user/table_name/timestamp

As overwrite is not possible, We will write the output file to a new location. 由于无法覆盖,因此我们会将输出文件写入新位置。

Write the data to that new location using dataframe Api 使用数据框Api将数据写入新位置

df.write.orc(hdfs://user/xx/tablename/newtimestamp/)

Once Data is written alter the hive table location to new location 写入数据后,将配置单元表位置更改为新位置

Alter table tablename set Location hdfs://user/xx/tablename/newtimestamp/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM