简体   繁体   中英

Add a list of values to each row in a PySpark dataframe

I have a PySpark dataframe which we'll assume looks as below:

df = spark.createDataFrame([(10,11,12,13), (20,21,22,23), (30,31,32,33)],['var1', 'var2', 'var3', 'var4'])
df.show()
>
|---------------------|------------------|---------------------|------------------|
|         var1        |        var2      |         var3        |       var4       |
|---------------------|------------------|---------------------|------------------|
|          10         |         11       |          12         |         13       |
|---------------------|------------------|---------------------|------------------|
|          20         |         21       |          22         |         23       |
|---------------------|------------------|---------------------|------------------|
|          30         |         31       |          32         |         33       |
|---------------------|------------------|---------------------|------------------|

I am trying to subtract a list of values from each of these rows. Let's say the list is [1,1,2,2]. The expected result from this operation is:

|---------------------|------------------|---------------------|------------------|
|         var1        |        var2      |         var3        |       var4       |
|---------------------|------------------|---------------------|------------------|
|          9          |         10       |          10         |         11       |
|---------------------|------------------|---------------------|------------------|
|          19         |         20       |          20         |         21       |
|---------------------|------------------|---------------------|------------------|
|          29         |         30       |          30         |         31       |
|---------------------|------------------|---------------------|------------------|

I would like to multiply each row from this intermediate dataframe with another list [2,1,1,3] to create

|---------------------|------------------|---------------------|------------------|
|         var1        |        var2      |         var3        |       var4       |
|---------------------|------------------|---------------------|------------------|
|          18         |         10       |          10         |         22       |
|---------------------|------------------|---------------------|------------------|
|          38         |         20       |          20         |         42       |
|---------------------|------------------|---------------------|------------------|
|          58         |         30       |          30         |         62       |
|---------------------|------------------|---------------------|------------------|

I am new to PySpark and could not figure out a way to accomplish this. Any suggestions would be appreciated!

Easiest way would be:

import pyspark.sql.functions as F 
for column, sub, mul in zip(df.columns, [1,1,2,2], [2,1,1,3]):
    df = df.withColumn(column, (F.col(column) - sub) * mul)

df.show()

"""
+----+----+----+----+
|var1|var2|var3|var4|
+----+----+----+----+
|  18|  10|  10|  33|
|  38|  20|  20|  63|
|  58|  30|  30|  93|
+----+----+----+----+
"""

You can split the operations too, and save to intermediate variables for the different dataframes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM