I have a PySpark DataFrame, df1, that looks like:
CustomerID CustomerValue CustomerValue2
15 10 2
16 10 3
18 3 3
I have a second PySpark DataFrame, df2
CustomerID CustomerValue
15 2
16 3
18 4
I want to multiply all the columns of df1(I have more than two columns) with the value of df2 join on customer ID. So i want to have something like that
CustomerID CosineCustVal CosineCustVal
15 20 4
16 30 9
18 12 9
Once you join them, you can run a for loop on the columns of df1:
from pyspark.sql import functions as F
df_joined = df1.join(df2, df1.CustomerID == df2.CustomerID)
for col_name in df_joined.columns:
if col_name != 'CustomerValue':
df_joined = df_joined.withColumn(col_name, F.column(col_name) * F.column('CustomerValue'))
Based on this article, spark will create a smart plan even though the for loop would suggest otherwise (remember that spark only starts the calculations once you call an action
, until that you just assign transformations
: https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations ).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.