简体   繁体   中英

Multiply two pyspark dataframes

I have a PySpark DataFrame, df1, that looks like:

CustomerID  CustomerValue CustomerValue2 
15          10            2
16          10            3
18          3             3

I have a second PySpark DataFrame, df2

 CustomerID  CustomerValue 
 15          2          
 16          3           
 18          4        

I want to multiply all the columns of df1(I have more than two columns) with the value of df2 join on customer ID. So i want to have something like that

 CustomerID     CosineCustVal CosineCustVal
 15             20            4
 16             30            9
 18             12            9

Once you join them, you can run a for loop on the columns of df1:

from pyspark.sql import functions as F

df_joined = df1.join(df2, df1.CustomerID == df2.CustomerID)
for col_name in df_joined.columns:
    if col_name != 'CustomerValue':
        df_joined = df_joined.withColumn(col_name, F.column(col_name) * F.column('CustomerValue'))

Based on this article, spark will create a smart plan even though the for loop would suggest otherwise (remember that spark only starts the calculations once you call an action , until that you just assign transformations : https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM