简体   繁体   中英

Pyspark - Padding zeros of array int datatype without pandas udf

Need to left padding in Array column of pyspark dataframe without using pandasudf.

Input Dataframe:
|lags|
|----|
|[0]|
|[0,1,2]|
|[0,1]|

Output Data frame:
|lags|
|----|
|[0,0,0]|
|[0,1,2]|
|[0,0,1]|

You can use array_repeat to create zero padding array and concat them.

Use @ARCrow's function to identify the max array size.

max_arr_size = 3

df = (df.withColumn('pad', F.array_repeat(F.lit(0), max_arr_size - F.size('lags')))
      .withColumn('padded', F.concat('pad', 'lags')))

This is how I did it

import pyspark.sql.functions as f

df = spark.createDataFrame([
    ([0],),
    ([0,1,2],),
    ([0,1],),
    (None,)
], ['lags'])

max_size = (df
            .withColumn('array_size', f.size(f.col('lags')))
            .groupBy()
            .agg(f.max(f.col('array_size')).alias('max_size'))
            .collect()[0].max_size
           )
df = (df
      .withColumn('lags', f.when(f.col('lags').isNull(), f.array(*[])).otherwise(f.col('lags'))) #to deal with null values
      .withColumn('pre_zeros', f.sequence(f.lit(0), f.lit(max_size) - f.size(f.col('lags'))))
      .withColumn('zeros', f.expr('transform(slice(pre_zeros, 1, size(pre_zeros) - 1), element -> 0)'))
      .withColumn('final_lags', f.concat(f.col('zeros'), f.col('lags')))
     )

df.show()

And the output is:

+---------+------------+---------+----------+
|     lags|   pre_zeros|    zeros|final_lags|
+---------+------------+---------+----------+
|      [0]|   [0, 1, 2]|   [0, 0]| [0, 0, 0]|
|[0, 1, 2]|         [0]|       []| [0, 1, 2]|
|   [0, 1]|      [0, 1]|      [0]| [0, 0, 1]|
|       []|[0, 1, 2, 3]|[0, 0, 0]| [0, 0, 0]|
+---------+------------+---------+----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM