简体   繁体   中英

Find delta between every 2 columns in Pyspark

I am trying to find the difference between every two columns in a pyspark dataframe with 100+ columns. If it was less, I could manually create a new column each time by doing df.withColumn('delta', df.col1 - df.col2) but I am trying to do this in a more concise way. Any ideas?

col1 col2 col3 col4
1 5 3 9

Wanted output:

delta1 delta2
4 6

All you have to do it creating a proper for loop to read through the list of columns and do your subtraction

Sample data

df = spark.createDataFrame([
    (1, 4, 7, 8),
    (0, 5, 3, 9),
], ['c1', 'c2', 'c3', 'c4'])

+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|1  |4  |7  |8  |
|0  |5  |3  |9  |
+---+---+---+---+

Loop through columns

from pyspark.sql import functions as F

cols = []
for i in range(len(df.columns)):
    if i % 2 == 0:
        cols.append((F.col(df.columns[i + 1]) - F.col(df.columns[i])).alias(f'delta{i}'))

df.select(cols).show()

+------+------+
|delta0|delta2|
+------+------+
|     3|     1|
|     5|     6|
+------+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM