Add a list of values to each row in a PySpark dataframe

Question

I have a PySpark dataframe which we'll assume looks as below:

df = spark.createDataFrame([(10,11,12,13), (20,21,22,23), (30,31,32,33)],['var1', 'var2', 'var3', 'var4'])
df.show()
>
|---------------------|------------------|---------------------|------------------|
|         var1        |        var2      |         var3        |       var4       |
|---------------------|------------------|---------------------|------------------|
|          10         |         11       |          12         |         13       |
|---------------------|------------------|---------------------|------------------|
|          20         |         21       |          22         |         23       |
|---------------------|------------------|---------------------|------------------|
|          30         |         31       |          32         |         33       |
|---------------------|------------------|---------------------|------------------|

I am trying to subtract a list of values from each of these rows. Let's say the list is [1,1,2,2]. The expected result from this operation is:

|---------------------|------------------|---------------------|------------------|
|         var1        |        var2      |         var3        |       var4       |
|---------------------|------------------|---------------------|------------------|
|          9          |         10       |          10         |         11       |
|---------------------|------------------|---------------------|------------------|
|          19         |         20       |          20         |         21       |
|---------------------|------------------|---------------------|------------------|
|          29         |         30       |          30         |         31       |
|---------------------|------------------|---------------------|------------------|

I would like to multiply each row from this intermediate dataframe with another list [2,1,1,3] to create

|---------------------|------------------|---------------------|------------------|
|         var1        |        var2      |         var3        |       var4       |
|---------------------|------------------|---------------------|------------------|
|          18         |         10       |          10         |         22       |
|---------------------|------------------|---------------------|------------------|
|          38         |         20       |          20         |         42       |
|---------------------|------------------|---------------------|------------------|
|          58         |         30       |          30         |         62       |
|---------------------|------------------|---------------------|------------------|

I am new to PySpark and could not figure out a way to accomplish this. Any suggestions would be appreciated!

Answer 1

Easiest way would be:

import pyspark.sql.functions as F 
for column, sub, mul in zip(df.columns, [1,1,2,2], [2,1,1,3]):
    df = df.withColumn(column, (F.col(column) - sub) * mul)

df.show()

"""
+----+----+----+----+
|var1|var2|var3|var4|
+----+----+----+----+
|  18|  10|  10|  33|
|  38|  20|  20|  63|
|  58|  30|  30|  93|
+----+----+----+----+
"""

You can split the operations too, and save to intermediate variables for the different dataframes.

Add a list of values to each row in a PySpark dataframe

Question

1 answers

solution1
1 2020-02-01 03:51:21

Add a list of values to each row in a PySpark dataframe

Question

1 answers

solution1 1 2020-02-01 03:51:21

solution1
1 2020-02-01 03:51:21