I am trying to find the difference between every two columns in a pyspark dataframe with 100+ columns. If it was less, I could manually create a new column each time by doing df.withColumn('delta', df.col1 - df.col2)
but I am trying to do this in a more concise way. Any ideas?
col1 | col2 | col3 | col4 |
---|---|---|---|
1 | 5 | 3 | 9 |
Wanted output:
delta1 | delta2 |
---|---|
4 | 6 |
All you have to do it creating a proper for loop to read through the list of columns and do your subtraction
Sample data
df = spark.createDataFrame([
(1, 4, 7, 8),
(0, 5, 3, 9),
], ['c1', 'c2', 'c3', 'c4'])
+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|1 |4 |7 |8 |
|0 |5 |3 |9 |
+---+---+---+---+
Loop through columns
from pyspark.sql import functions as F
cols = []
for i in range(len(df.columns)):
if i % 2 == 0:
cols.append((F.col(df.columns[i + 1]) - F.col(df.columns[i])).alias(f'delta{i}'))
df.select(cols).show()
+------+------+
|delta0|delta2|
+------+------+
| 3| 1|
| 5| 6|
+------+------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.