I have a dataframe like this -
-RECORD 0-------------------------------------------
id | 11
order_number | 254
order_date | 2021-03-09
store_id | abc6
employee_code | 6921_abc40
customer_name | harvey
contact_number | 353
address | foo
locality | foo
postal_code | 5600082332
order_info | info
amount | 478.8
payment_type | null
timeA | 2021-03-10 01:34:26
timeB | 2021-03-10 01:35:26
-RECORD 1-------------------------------------------
id | 12
order_number | 2272
order_date | 2021-03-09
store_id | abc666
employee_code | 66_abc55
customer_name | mike
contact_number | 98
address | bar
locality | bar
postal_code | 11000734332
order_info | info
amount_to_be_collected | 0.34
payment_type | null
timeA | 2021-03-10 00:18:04
timeB | 2021-03-10 03:21:06
I want to do the following -
Groupby the records by employee_code and get a dictionary in return which would be something like this -
{"emp_code": [Record0, Record1, ....]}
ie, the employee code as the key and a list of all records of that employee as the value.
I am writing a Gluejob for this. I can do this programmatically by looping through all the records and getting the desired dictionary, but this will take a lot of time. I want to know if there is a way to achieve this result by using some higher order pyspark functions?
You can create a map that has a key based on employee_code
and a struct or array as the value as:
df = df.select(map(col("employee_code"), struct("order_number", "order_date",,)).alias("complex_map"))
Then you can query it using selectExpr
as a map:
df.selectExpr("complex_map['employee_code']").show(2)
For this you'll need to do some transformations into ComplexTypes before grouping by
them which basically transforms the structure from sth like:
DataFrame[order_number: string, employee_code: string, ....]>
into sth like this:
DataFrame[employee_code: string, complex: struct<order_number:string,contact_number:int>]>
That can be done using the struct
function from from pyspark.sql.functions import struct
:
from pyspark.sql.functions import struct
df.select(col("employee_code"), struct("order_number", "order_date", ...).alias("orders"))
Once you have them in that kind of struct you can perform a group by and use the aggregation function collect_list:
from pyspark.sql.functions import struct, collect_list
df.select(col("employee_code"), struct("order_number", "order_date", ...).alias("orders")).groupBy("employee_code").agg(collect_list("orders").alias("orders")
Then you can select individual cols within the struct as:
df.select(col("orders.order_number"))
Or even filtering them by:
df.select(col("employee_code")).where(col("orders.order_number") > 100)
If you want be back at what you were take a look at the explode
function, which takes a column of arrays and creates one row (with the rest of the values as duplicated)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.