How to groupby a data frame in pyspark by a column and get a dictionary with that column as key and list of records as its value?

Question

I have a dataframe like this -

-RECORD 0-------------------------------------------

 id                          | 11           
 order_number                | 254                  
 order_date                  | 2021-03-09           
 store_id                    | abc6            
 employee_code               | 6921_abc40    
 customer_name               | harvey 
 contact_number              | 353          
 address                     | foo 
 locality                    | foo               
 postal_code                 | 5600082332             
 order_info                  | info
 amount                      | 478.8                
 payment_type                | null                 
 timeA                       | 2021-03-10 01:34:26
 timeB                       | 2021-03-10 01:35:26  
             
-RECORD 1-------------------------------------------

 id                          | 12            
 order_number                | 2272                 
 order_date                  | 2021-03-09           
 store_id                    | abc666             
 employee_code               | 66_abc55               
 customer_name               | mike        
 contact_number              | 98          
 address                     | bar
 locality                    | bar
 postal_code                 | 11000734332              
 order_info                  | info
 amount_to_be_collected      | 0.34                 
 payment_type                | null                 
 timeA                       | 2021-03-10 00:18:04  
 timeB                       | 2021-03-10 03:21:06

I want to do the following -

Groupby the records by employee_code and get a dictionary in return which would be something like this -

{"emp_code": [Record0, Record1, ....]}

ie, the employee code as the key and a list of all records of that employee as the value.

I am writing a Gluejob for this. I can do this programmatically by looping through all the records and getting the desired dictionary, but this will take a lot of time. I want to know if there is a way to achieve this result by using some higher order pyspark functions?

Answer 1

Using maps

You can create a map that has a key based on employee_code and a struct or array as the value as:

df = df.select(map(col("employee_code"), struct("order_number", "order_date",,)).alias("complex_map"))

Then you can query it using selectExpr as a map:

df.selectExpr("complex_map['employee_code']").show(2)

Alternative with structs:

For this you'll need to do some transformations into ComplexTypes before grouping by them which basically transforms the structure from sth like:

DataFrame[order_number: string, employee_code: string, ....]>

into sth like this:

DataFrame[employee_code: string, complex: struct<order_number:string,contact_number:int>]>

That can be done using the struct function from from pyspark.sql.functions import struct :

from pyspark.sql.functions import struct

df.select(col("employee_code"), struct("order_number", "order_date", ...).alias("orders"))

Once you have them in that kind of struct you can perform a group by and use the aggregation function collect_list:

from pyspark.sql.functions import struct, collect_list

df.select(col("employee_code"), struct("order_number", "order_date", ...).alias("orders")).groupBy("employee_code").agg(collect_list("orders").alias("orders")

Then you can select individual cols within the struct as:

df.select(col("orders.order_number"))

Or even filtering them by:

df.select(col("employee_code")).where(col("orders.order_number") > 100)

If you want be back at what you were take a look at the explode function, which takes a column of arrays and creates one row (with the rest of the values as duplicated)

How to groupby a data frame in pyspark by a column and get a dictionary with that column as key and list of records as its value?

Question

1 answers

solution1
0 2021-04-13 18:49:39

Using maps

Alternative with structs:

How to groupby a data frame in pyspark by a column and get a dictionary with that column as key and list of records as its value?

Question

1 answers

solution1 0 2021-04-13 18:49:39

Using maps

Alternative with structs:

solution1
0 2021-04-13 18:49:39