如何減去兩個PySpark dataframe的所有列值？

Question

嗨，我遇到過這種情況，我需要像這樣減去兩個 PySpark dataframe 之間的所有列值：df1：

col1 col2 ... col100
 1    2   ...  100

df2:

col1 col2 ... col100
5     4   ...  20

我想用 df1 - df2: new df: 得到最終的 dataframe

col1 col2  ... col100
-4     -2  ...   80

我檢查了可能的解決方案是減去兩列，如：

new_df = df1.withColumn('col1', df1['col1'] - df2['col1'])

但是我有101個列，如何簡單的遍歷整個東西，避免寫出101個類似的邏輯呢？ 任何答案都超級合適！

對於 101 列如何簡單地遍歷所有列並減去它的值？

Answer 1

您可以創建一個 for 循環來遍歷列並在 dataframe 中使用減去的值創建新列。 這是 PySpark 中的一種方法：

columns = df1.columns

for col in columns:
    df1 = df1.withColumn(col, df1[col] - df2[col])

這將創建一個新的 dataframe，其中包含每列的減去值。

Answer 2

在具有 python 列表理解的單個 select 中：

columns = df1.columns

df1 = df1.select(*(df1[col] - df2[col]).alias(col) for col in columns))