遍歷數據塊倉庫中的表並使用 pyspark 將某些值提取到另一個增量表中

Question

有以下問題，使用中級 pyspark 技能可能很容易解決。

我想從數據塊倉庫中的某些表中提取某些時間戳，並將它們覆蓋存儲到“舊時間戳”的現有增量表中。 我面臨的挑戰是編寫如此通用的代碼，使其可以處理不同數量的表格並循環遍歷表格並提取時間戳 - 所有這些都在一個流暢的代碼片段中

我的第一個命令應該過濾我只想獲取存儲時間戳的表的相關表

%sql SHOW TABLES FROM database1 LIKE 'date_stamp'

之后我想查看結果的每個表並收集最新的（最大）時間戳

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.sql("SELECT timestamp FROM table_date_stamp_source1")
df_filtered=df.filter(df.timestamp.max)

某個表（即源）的每個最大時間戳都應存儲在時間戳的數據框中（此處：final_df）並替換舊時間戳

from pyspark.sql.functions import when
final_df = final_df.withColumn("timestamp_max", when(final_df.source == "table_data_stamp_source1" , final_df.timestamp_max == df_filtered.timestamp) \
      .otherwise(final_df.timestamp_max))

此代碼無法正確執行，但可能會讓您了解我想要做什么。

謝謝
蒙蒂

添加於 21.12.22

我現在在表上添加了一些迭代，並希望從第一個答案中集成過濾器代碼，但由於我的列的某些格式，我運行時出錯了？！

df_relevant_Tables=sqlContext.sql("SHOW TABLES FROM db1 LIKE '*date*' ")
df_relevant_Tables.select(df_relevant_Tables.columns[1])
for index, row in df_relevant_Tables.iterrows():
df_name = row
...
latest_date=df.select(max("db1.{df_name}.timestamp_column"))

然后我收到以下錯誤消息：

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `z` cannot be resolved. Did you mean one of the following? [`spark_catalog`.`db1`.`df_name`.`timestamp_column`];
'Project ['z]

我該如何解決？

Answer 1

根據您的代碼，上面的以下修改可能對您有用。

df_filtered=df.filter(df.timestamp.max)

從數據框中獲取最大時間戳，如下所示。

max_timestamp=df.select(max('timestamp')).head()[0]

然后在下一個代碼中使用這個max_timestamp 。

 from pyspark.sql.functions import when final_df = final_df.withColumn("timestamp_max", when(final_df.source == "table_data_stamp_source1", final_df.timestamp_max ==df_filtered.timestamp).otherwise(final_df.timestamp_max))

在你的when有一個條件，然后還有另一個條件。

when(_condition_, _value_)這是 when 語法，條件后給出如下所示的max_timestamp 。

final_df = final_df.withColumn("timestamp_max", when(final_df.source == "table_data_stamp_source1" , max_timestamp).otherwise(final_df.timestamp_max))

我采用了如下示例數據框。 我用了id而不是timestamp 。

在此處輸入圖像描述

這是用於查找最高 ID 的第二個數據框。

在此處輸入圖像描述

查找最高 ID（時間戳你的情況）並替換firstname=='Rakesh'處的id 。

在此處輸入圖像描述

遍歷數據塊倉庫中的表並使用 pyspark 將某些值提取到另一個增量表中

問題描述

1 個解決方案

解決方案1
0 2022-12-13 08:23:55

遍歷數據塊倉庫中的表並使用 pyspark 將某些值提取到另一個增量表中

問題描述

1 個解決方案

解決方案1 0 2022-12-13 08:23:55

解決方案1
0 2022-12-13 08:23:55