Multiply two pyspark dataframes

Question

I have a PySpark DataFrame, df1, that looks like:

CustomerID  CustomerValue CustomerValue2 
15          10            2
16          10            3
18          3             3

I have a second PySpark DataFrame, df2

 CustomerID  CustomerValue 
 15          2          
 16          3           
 18          4

I want to multiply all the columns of df1(I have more than two columns) with the value of df2 join on customer ID. So i want to have something like that

 CustomerID     CosineCustVal CosineCustVal
 15             20            4
 16             30            9
 18             12            9

Answer 1

Once you join them, you can run a for loop on the columns of df1:

from pyspark.sql import functions as F

df_joined = df1.join(df2, df1.CustomerID == df2.CustomerID)
for col_name in df_joined.columns:
    if col_name != 'CustomerValue':
        df_joined = df_joined.withColumn(col_name, F.column(col_name) * F.column('CustomerValue'))

Based on this article, spark will create a smart plan even though the for loop would suggest otherwise (remember that spark only starts the calculations once you call an action , until that you just assign transformations : https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations ).

Multiply two pyspark dataframes

Question

1 answers

solution1
0 ACCPTED 2018-09-27 22:34:40

Multiply two pyspark dataframes

Question

1 answers

solution1 0 ACCPTED 2018-09-27 22:34:40

solution1
0 ACCPTED 2018-09-27 22:34:40