简体   繁体   English

遍历数据块仓库中的表并使用 pyspark 将某些值提取到另一个增量表中

[英]loop through tables in databricks warehouse and extract certain values into another delta table with pyspark

have the following problem, which might be pretty easy to solve with intermediate pyspark skills.有以下问题,使用中级 pyspark 技能可能很容易解决。

I want to extract certain timestamps from certain tables in a databricks warehouse and store them with overwrite into an existing delta table of the "old timestamps".我想从数据块仓库中的某些表中提取某些时间戳,并将它们覆盖存储到“旧时间戳”的现有增量表中。 The challenge for me is to write the code so generic that it can handle varying amount of tables and loop through the tables and extracting the timestamp - all in one fluent code snippet我面临的挑战是编写如此通用的代码,使其可以处理不同数量的表格并循环遍历表格并提取时间戳 - 所有这些都在一个流畅的代码片段中

My first command should filter the relevant tables where I want to get only the tables which store the time stamps我的第一个命令应该过滤我只想获取存储时间戳的表的相关表

%sql SHOW TABLES FROM database1 LIKE 'date_stamp'

After that I want to look in every table of the result and collect the latest (max) timestamp之后我想查看结果的每个表并收集最新的(最大)时间戳

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.sql("SELECT timestamp FROM table_date_stamp_source1")
df_filtered=df.filter(df.timestamp.max)

Every max timestamp for a certain table (ie source) should be stored in the dataframe for time stamps (here: final_df) and replace there the old time stamp某个表(即源)的每个最大时间戳都应存储在时间戳的数据框中(此处:final_df)并替换旧时间戳

from pyspark.sql.functions import when
final_df = final_df.withColumn("timestamp_max", when(final_df.source == "table_data_stamp_source1" , final_df.timestamp_max == df_filtered.timestamp) \
      .otherwise(final_df.timestamp_max))

This code does not execute properly but might give you an idea what I want to do.此代码无法正确执行,但可能会让您了解我想要做什么。

Thanks谢谢
Monty蒙蒂

adding on 21.12.22添加于 21.12.22

I added now some iteration on the tables and want to integrate the filter code from the first answer but im running in an error due to some formatting of my columns?!我现在在表上添加了一些迭代,并希望从第一个答案中集成过滤器代码,但由于我的列的某些格式,我运行时出错了?!

df_relevant_Tables=sqlContext.sql("SHOW TABLES FROM db1 LIKE '*date*' ")
df_relevant_Tables.select(df_relevant_Tables.columns[1])
for index, row in df_relevant_Tables.iterrows():
df_name = row
...
latest_date=df.select(max("db1.{df_name}.timestamp_column"))

I then get the following error message:然后我收到以下错误消息:

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `z` cannot be resolved. Did you mean one of the following? [`spark_catalog`.`db1`.`df_name`.`timestamp_column`];
'Project ['z]

How can I resolve it?我该如何解决?

As per your code, the below modifications in the above may work for you.根据您的代码,上面的以下修改可能对您有用。

df_filtered=df.filter(df.timestamp.max)

Get the max timestamp from the dataframe like below.从数据框中获取最大时间戳,如下所示。

max_timestamp=df.select(max('timestamp')).head()[0]

Then use this max_timestamp in the next code.然后在下一个代码中使用这个max_timestamp

 from pyspark.sql.functions import when final_df = final_df.withColumn("timestamp_max", when(final_df.source == "table_data_stamp_source1", final_df.timestamp_max ==df_filtered.timestamp).otherwise(final_df.timestamp_max))

In your when there is a condition and after that also another condition.在你的when有一个条件,然后还有另一个条件。

when(_condition_, _value_) this is the when syntax and after condition give the max_timestamp like below. when(_condition_, _value_)这是 when 语法,条件后给出如下所示的max_timestamp

final_df = final_df.withColumn("timestamp_max", when(final_df.source == "table_data_stamp_source1" , max_timestamp).otherwise(final_df.timestamp_max))

I have taken a sample dataframe like below.我采用了如下示例数据框。 I have taken id instead of timestamp .我用了id而不是timestamp

在此处输入图像描述

This is second dataframe for finding highest id.这是用于查找最高 ID 的第二个数据框。

在此处输入图像描述

Finding highest id(timestamp your case) and replacing the id where firstname=='Rakesh' .查找最高 ID(时间戳你的情况)并替换firstname=='Rakesh'处的id

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM