简体繁体 English

在 spark/delta 湖中同时更改多列评论

[英]Alter multiple column comments simultaneously in spark/delta lake

原文 2022-05-13 16:43:46 1 2 python/ apache-spark/ pyspark/ databricks/ delta-lake

Short version : Need a faster/better way to update many column comments at once in spark/databricks.短版：需要更快/更好的方式在 spark/databricks 中一次更新许多列注释。 I have a pyspark notebook that can do this sequentially across many tables, but if I call it from multiple tasks they take so long waiting on a hive connection that I get timeout failures.我有一个 pyspark 笔记本，它可以在许多表中按顺序执行此操作，但是如果我从多个任务中调用它，它们会在配置单元连接上等待很长时间，以至于我会遇到超时失败。

Command used: ALTER TABLE my_db_name.my_table_name CHANGE my_column COMMENT "new comment" ( docs )使用的命令： ALTER TABLE my_db_name.my_table_name CHANGE my_column COMMENT "new comment" ( docs )

Long version : I have a data dictionary notebook where I maintain column descriptions that are reused across multiple tables.长版：我有一个数据字典笔记本，我在其中维护在多个表中重复使用的列描述。 If I run the notebook directly it successfully populates all my database table and column comments by issuing the above command sequentially for every column across all tables (and the corresponding table description command once).如果我直接运行笔记本，它会通过为所有表中的每一列顺序发出上述命令（以及相应的表描述命令一次）来成功填充我的所有数据库表和列注释。

I'm trying to move this to a by-table call.我正在尝试将其移至按桌通话。 In the databricks tasks that populate the tables I have a check to see if the output table exist.在填充表的数据块任务中，我检查了输出表是否存在。 If not it's created, and at the end I call the dictionary notebook (using dbutils.notebook.run("Data Dictionary Creation", 600, {"db": output_db, "update_table": output_table}) to populate the comments for that particular table. If this happens simultaneously for multiple tables however the notebook calls now timeout, as most of the tasks spend a lot of time waiting for client connection with hive . This is true even though there's only one call of the notebook per table.如果没有创建它，最后我调用字典笔记本（使用dbutils.notebook.run("Data Dictionary Creation", 600, {"db": output_db, "update_table": output_table})填充评论特定的表。如果多个表同时发生这种情况，但是笔记本调用现在超时，因为大多数任务花费大量时间waiting for client connection with hive 。即使每个表只有一个笔记本调用也是如此。

Solution Attempts:解决方案尝试：

I tried many variations of the above command to update all column comments in one call per table, but it's either impossible or my syntax is wrong.我尝试了上述命令的许多变体，以在每个表一次调用中更新所有列注释，但这要么是不可能的，要么是我的语法错误。
It's unclear to me how to avoid the timeout issues (I've doubled timeout to 10 minutes and it still fails, while the original notebook takes much less time than that to run across all tables!).我不清楚如何避免超时问题（我已经将超时时间加倍到 10 分钟，但它仍然失败，而原来的笔记本比在所有表上运行所需的时间要少得多！）。 I need to wait for completion before continuing to the next task (or I'd spawn it as a process).在继续下一个任务之前，我需要等待完成（或者我将它作为一个进程生成）。

Update: I think what's happening here is that the above Alter command is being called in a loop, and when I schedule a job this loop is being distributed and called in parallel.更新：我认为这里发生的事情是上面的 Alter 命令被循环调用，当我安排一个作业时，这个循环被并行分发和调用。 What I may actually need is a way to call it, or a function in it, without letting the loop be distributed.我可能真正需要的是一种调用它的方法，或者其中的一个函数，而不让循环被分发。 Is there a way to force sequential execution for a single function?有没有办法强制顺序执行单个函数？

2 个解决方案

In the end I found a solution for this issue.最后我找到了解决这个问题的方法。

First, the problem seems to have been that the loop with the ALTER command was getting parallelized by spark, and thus firing multiple (conflicting) commands simultaneously on the same table.首先，问题似乎是带有ALTER命令的循环被 spark 并行化，因此在同一个表上同时触发了多个（冲突的）命令。

The answer to this was two-fold:这个问题的答案有两个：

Add a .coalesce(1) to the end of the function I was calling with the ALTER line.将.coalesce(1)添加到我使用ALTER行调用的函数的末尾。 This limits the function to sequential execution.这将函数限制为顺序执行。
Return a newly-created empty dataframe from the function to avoid coalesce-based errors.从函数返回一个新创建的空数据框以避免基于合并的错误。

Part 2 seems to have been necessary because this command is I think meant to get a result back for aggregation.第 2 部分似乎是必要的，因为我认为这个命令是为了获取结果以进行聚合。 I couldn't find a way to make it work without that ( .repartition(1) had the same issue), so in the end I returned spark.createDataFrame([ (1, "foo")],["id", "label"]) from the function and things then worked.如果没有它，我找不到让它工作的方法（ .repartition(1)有同样的问题），所以最后我返回spark.createDataFrame([ (1, "foo")],["id", "label"])从功能和事情开始。

This gets me to my desired end goal of working through all the alter commands without conflict errors.这使我达到了我想要的最终目标，即在没有冲突错误的情况下完成所有修改命令。

It's clunky as hell though;虽然它很笨重。 still love improvements or alternative approaches if anyone has one.如果有人有的话，仍然喜欢改进或替代方法。

If you want to change multiple columns at once, why not recreate the table?如果您想一次更改多个列，为什么不重新创建表呢？ (This trick will work only if table 'B' is an external table. Here table 'B' is the 'B'ad table with outdated comments. Table 'A' is the good table with good comments.) （这个技巧只有在表'B'是一个外部表时才有效。这里的表'B'是带有过时评论的'B'ad表。表'A'是带有良好评论的好表。）

drop table ('B')删除表（'B'）
create table with required comments ( 'A' )创建带有所需注释的表（'A'）

If this table is NOT external, then you might want to create a view , and start using that.如果这个表不是外部的，那么你可能想要创建一个视图，然后开始使用它。 This would enable you to add updated comments without altering the original tables data.这将使您能够在不更改原始表数据的情况下添加更新的注释。

Have you considered using table properties instead of comments?您是否考虑过使用表格属性而不是注释？