I have a PySpark dataframe which we'll assume looks as below:
df = spark.createDataFrame([(10,11,12,13), (20,21,22,23), (30,31,32,33)],['var1', 'var2', 'var3', 'var4'])
df.show()
>
|---------------------|------------------|---------------------|------------------|
| var1 | var2 | var3 | var4 |
|---------------------|------------------|---------------------|------------------|
| 10 | 11 | 12 | 13 |
|---------------------|------------------|---------------------|------------------|
| 20 | 21 | 22 | 23 |
|---------------------|------------------|---------------------|------------------|
| 30 | 31 | 32 | 33 |
|---------------------|------------------|---------------------|------------------|
I am trying to subtract a list of values from each of these rows. Let's say the list is [1,1,2,2]. The expected result from this operation is:
|---------------------|------------------|---------------------|------------------|
| var1 | var2 | var3 | var4 |
|---------------------|------------------|---------------------|------------------|
| 9 | 10 | 10 | 11 |
|---------------------|------------------|---------------------|------------------|
| 19 | 20 | 20 | 21 |
|---------------------|------------------|---------------------|------------------|
| 29 | 30 | 30 | 31 |
|---------------------|------------------|---------------------|------------------|
I would like to multiply each row from this intermediate dataframe with another list [2,1,1,3] to create
|---------------------|------------------|---------------------|------------------|
| var1 | var2 | var3 | var4 |
|---------------------|------------------|---------------------|------------------|
| 18 | 10 | 10 | 22 |
|---------------------|------------------|---------------------|------------------|
| 38 | 20 | 20 | 42 |
|---------------------|------------------|---------------------|------------------|
| 58 | 30 | 30 | 62 |
|---------------------|------------------|---------------------|------------------|
I am new to PySpark and could not figure out a way to accomplish this. Any suggestions would be appreciated!
Easiest way would be:
import pyspark.sql.functions as F
for column, sub, mul in zip(df.columns, [1,1,2,2], [2,1,1,3]):
df = df.withColumn(column, (F.col(column) - sub) * mul)
df.show()
"""
+----+----+----+----+
|var1|var2|var3|var4|
+----+----+----+----+
| 18| 10| 10| 33|
| 38| 20| 20| 63|
| 58| 30| 30| 93|
+----+----+----+----+
"""
You can split the operations too, and save to intermediate variables for the different dataframes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.