简体   繁体   中英

loop through tables in databricks warehouse and extract certain values into another delta table with pyspark

have the following problem, which might be pretty easy to solve with intermediate pyspark skills.

I want to extract certain timestamps from certain tables in a databricks warehouse and store them with overwrite into an existing delta table of the "old timestamps". The challenge for me is to write the code so generic that it can handle varying amount of tables and loop through the tables and extracting the timestamp - all in one fluent code snippet

My first command should filter the relevant tables where I want to get only the tables which store the time stamps

%sql SHOW TABLES FROM database1 LIKE 'date_stamp'

After that I want to look in every table of the result and collect the latest (max) timestamp

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.sql("SELECT timestamp FROM table_date_stamp_source1")
df_filtered=df.filter(df.timestamp.max)

Every max timestamp for a certain table (ie source) should be stored in the dataframe for time stamps (here: final_df) and replace there the old time stamp

from pyspark.sql.functions import when
final_df = final_df.withColumn("timestamp_max", when(final_df.source == "table_data_stamp_source1" , final_df.timestamp_max == df_filtered.timestamp) \
      .otherwise(final_df.timestamp_max))

This code does not execute properly but might give you an idea what I want to do.

Thanks
Monty

adding on 21.12.22

I added now some iteration on the tables and want to integrate the filter code from the first answer but im running in an error due to some formatting of my columns?!

df_relevant_Tables=sqlContext.sql("SHOW TABLES FROM db1 LIKE '*date*' ")
df_relevant_Tables.select(df_relevant_Tables.columns[1])
for index, row in df_relevant_Tables.iterrows():
df_name = row
...
latest_date=df.select(max("db1.{df_name}.timestamp_column"))

I then get the following error message:

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `z` cannot be resolved. Did you mean one of the following? [`spark_catalog`.`db1`.`df_name`.`timestamp_column`];
'Project ['z]

How can I resolve it?

As per your code, the below modifications in the above may work for you.

df_filtered=df.filter(df.timestamp.max)

Get the max timestamp from the dataframe like below.

max_timestamp=df.select(max('timestamp')).head()[0]

Then use this max_timestamp in the next code.

 from pyspark.sql.functions import when final_df = final_df.withColumn("timestamp_max", when(final_df.source == "table_data_stamp_source1", final_df.timestamp_max ==df_filtered.timestamp).otherwise(final_df.timestamp_max))

In your when there is a condition and after that also another condition.

when(_condition_, _value_) this is the when syntax and after condition give the max_timestamp like below.

final_df = final_df.withColumn("timestamp_max", when(final_df.source == "table_data_stamp_source1" , max_timestamp).otherwise(final_df.timestamp_max))

I have taken a sample dataframe like below. I have taken id instead of timestamp .

在此处输入图像描述

This is second dataframe for finding highest id.

在此处输入图像描述

Finding highest id(timestamp your case) and replacing the id where firstname=='Rakesh' .

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM