Find delta between every 2 columns in Pyspark

Question

I am trying to find the difference between every two columns in a pyspark dataframe with 100+ columns. If it was less, I could manually create a new column each time by doing df.withColumn('delta', df.col1 - df.col2) but I am trying to do this in a more concise way. Any ideas?

col1	col2	col3	col4
1	5	3	9

Wanted output:

delta1	delta2
4	6

Answer 1

All you have to do it creating a proper for loop to read through the list of columns and do your subtraction

Sample data

df = spark.createDataFrame([
    (1, 4, 7, 8),
    (0, 5, 3, 9),
], ['c1', 'c2', 'c3', 'c4'])

+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|1  |4  |7  |8  |
|0  |5  |3  |9  |
+---+---+---+---+

Loop through columns

from pyspark.sql import functions as F

cols = []
for i in range(len(df.columns)):
    if i % 2 == 0:
        cols.append((F.col(df.columns[i + 1]) - F.col(df.columns[i])).alias(f'delta{i}'))

df.select(cols).show()

+------+------+
|delta0|delta2|
+------+------+
|     3|     1|
|     5|     6|
+------+------+

Find delta between every 2 columns in Pyspark

Question

1 answers

solution1
1 2022-09-20 15:22:56

Find delta between every 2 columns in Pyspark

Question

1 answers

solution1 1 2022-09-20 15:22:56

solution1
1 2022-09-20 15:22:56