How to explode multiple columns of a dataframe in pyspark

Question

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.

Name  Age  Subjects                  Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]

I want to explode the dataframe in such a way that i get the following output-

Name Age Subjects Grades
Bob  16   Maths     A
Bob  16  Physics    B
Bob  16  Chemistry  C

How can I achieve this?

Answer 1

PySpark has added an arrays_zip function in 2.4, which eliminates the need for a Python UDF to zip the arrays.

import pyspark.sql.functions as F
from pyspark.sql.types import *

df = sql.createDataFrame(
    [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
    ['Name','Age','Subjects', 'Grades'])
df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))\
       .withColumn("new", F.explode("new"))\
       .select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
df.show()

+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]|    Maths|     A|
|[Bob]|[16]|  Physics|     B|
|[Bob]|[16]|Chemistry|     C|
+-----+----+---------+------+

Answer 2

This works,

import pyspark.sql.functions as F
from pyspark.sql.types import *

df = sql.createDataFrame(
    [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
    ['Name','Age','Subjects', 'Grades'])
df.show()

+-----+----+--------------------+---------+
| Name| Age|            Subjects|   Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+

Use udf with zip . Those columns needed to explode have to be merged before exploding.

combine = F.udf(lambda x, y: list(zip(x, y)),
              ArrayType(StructType([StructField("subs", StringType()),
                                    StructField("grades", StringType())])))

df = df.withColumn("new", combine("Subjects", "Grades"))\
       .withColumn("new", F.explode("new"))\
       .select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()


+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]|    Maths|     A|
|[Bob]|[16]|  Physics|     B|
|[Bob]|[16]|Chemistry|     C|
+-----+----+---------+------+

Answer 3

Arriving late to the party :-)

The simplest way to go is by using inline that doesn't have python API but is supported by selectExpr .

df.selectExpr('Name[0] as Name','Age[0] as Age','inline(arrays_zip(Subjects,Grades))').show()

+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16|    Maths|     A|
| Bob| 16|  Physics|     B|
| Bob| 16|Chemistry|     C|
+----+---+---------+------+

Answer 4

Have you tried this

df.select(explode(split(col("Subjects"))).alias("Subjects")).show()

you can convert the data frame to an RDD.

For an RDD you can use a flatMap function to separate the Subjects.

Answer 5

Copy/paste function if you need to repeat this quickly and easily across a large number of columns in a dataset

cols = ["word", "stem", "pos", "ner"]

def explode_cols(self, data, cols):
    data = data.withColumn('exp_combo', f.arrays_zip(*cols))
    data = data.withColumn('exp_combo', f.explode('exp_combo'))
    for col in cols:
        data = data.withColumn(col, f.col('exp_combo.' + col))

    return data.drop(f.col('exp_combo'))

result = explode_cols(data, cols)

Your welcome :)

Answer 6

When Exploding multiple columns, the above solution comes in handy only when the length of array is same, but if they are not. It is better to explode them separately and take distinct values each time.

df = sql.createDataFrame(
    [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
    ['Name','Age','Subjects', 'Grades'])

df = df.withColumn('Subjects',F.explode('Subjects')).select('Name','Age','Subjects', 'Grades').distinct()

df = df.withColumn('Grades',F.explode('Grades')).select('Name','Age','Subjects', 'Grades').distinct()

df.show()

 +----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16|    Maths|     A|
| Bob| 16|  Physics|     B|
| Bob| 16|Chemistry|     C|
+----+---+---------+------+

Answer 7

Thanks @nasty for saving the day. Just small tweaks to get the code working.

def explode_cols( df, cl):
df = df.withColumn('exp_combo', arrays_zip(*cl))
df = df.withColumn('exp_combo', explode('exp_combo'))
for colm in cl:
    final_col = 'exp_combo.'+ colm 
    df = df.withColumn(final_col, col(final_col))
    
    #print col
    #print ('exp_combo.'+ colm)
return df.drop(col('exp_combo'))

How to explode multiple columns of a dataframe in pyspark

Question

7 answers

solution1
48 2019-09-03 09:55:07

solution2
17 ACCPTED 2018-06-28 14:14:53

solution3
4 2022-02-02 13:07:16

solution4
1 2018-06-28 12:25:09

solution5
0 2022-04-07 23:04:43

solution6
0 2022-11-19 18:36:20

solution7
0 2023-01-01 10:41:22

How to explode multiple columns of a dataframe in pyspark

Question

7 answers

solution1 48 2019-09-03 09:55:07

solution2 17 ACCPTED 2018-06-28 14:14:53

solution3 4 2022-02-02 13:07:16

solution4 1 2018-06-28 12:25:09

solution5 0 2022-04-07 23:04:43

solution6 0 2022-11-19 18:36:20

solution7 0 2023-01-01 10:41:22

solution1
48 2019-09-03 09:55:07

solution2
17 ACCPTED 2018-06-28 14:14:53

solution3
4 2022-02-02 13:07:16

solution4
1 2018-06-28 12:25:09

solution5
0 2022-04-07 23:04:43

solution6
0 2022-11-19 18:36:20

solution7
0 2023-01-01 10:41:22