简体   繁体   English

如何从 spark dataframe 的层次关系创建字典

[英]How to create dictionary from hierarchical relations from spark dataframe

I have following spark data frame我有以下火花数据框

schema = 'EMPLOYEE_NUMBER int, MANAGER_EMPLOYEE_NUMBER int'
employees = spark.createDataFrame(
[[801,None], 
[1016,801], 
[1003,801], 
[1019,801], 
[1010,1003], 
[1004,1003], 
[1001,1003],
[1012,1004], 
[1002,1004], 
[1015,1004], 
[1008,1019], 
[1006,1019], 
[1014,1019],
[1011,1019]], schema=schema)

I want to create dictionary from above data frame like {801:[1003,1019,1016], 1019:[1014,1011,1008,1006], 1003:[1010,1001,1004]} can I build dictionary like this from data frame我想从上面的数据框创建字典,例如{801:[1003,1019,1016], 1019:[1014,1011,1008,1006], 1003:[1010,1001,1004]}我可以从数据框

datas = employees.groupBy('MANAGER_EMPLOYEE_NUMBER').agg(collect_set(col('EMPLOYEE_NUMBER')).alias('values')).collect()

ur_dict = {}
for item in datas:
    ur_dict[item['MANAGER_EMPLOYEE_NUMBER']] = item['values']

print(ur_dict)
# {1019: [1014, 1011, 1008, 1006], None: [801], 1003: [1004, 1001, 1010], 801: [1003, 1019, 1016], 1004: [1002, 1015, 1012]}

You can use collect_list in group all the employees under MANAGER_EMPLOYEE_NUMBER & then use collect & asDict in conjunction with map to transform the resultant into a dictionary您可以在MANAGER_EMPLOYEE_NUMBER下的所有员工中使用 collect_list ,然后将collectasDictmap结合使用,将结果转换为字典

Data Preparation数据准备

schema = 'EMPLOYEE_NUMBER int, MANAGER_EMPLOYEE_NUMBER int'

employees = sql.createDataFrame(
[[801,None], 
[1016,801], 
[1003,801], 
[1019,801], 
[1010,1003], 
[1004,1003], 
[1001,1003],
[1012,1004], 
[1002,1004], 
[1015,1004], 
[1008,1019], 
[1006,1019], 
[1014,1019],
[1011,1019]], schema=schema)

Collect List收集清单

employees_agg = employees.groupBy('MANAGER_EMPLOYEE_NUMBER')\
                         .agg(F.collect_list(F.col('EMPLOYEE_NUMBER')).alias('EMPLOYEES'))\
                         .filter(F.col('MANAGER_EMPLOYEE_NUMBER').isNotNull())


employees_agg.show()

+-----------------------+--------------------+
|MANAGER_EMPLOYEE_NUMBER|           EMPLOYEES|
+-----------------------+--------------------+
|                   1019|[1008, 1006, 1014...|
|                   1003|  [1010, 1004, 1001]|
|                    801|  [1016, 1003, 1019]|
|                   1004|  [1012, 1002, 1015]|
+-----------------------+--------------------+

Transform转换

final_dict = {
        row['MANAGER_EMPLOYEE_NUMBER']: row['EMPLOYEES'] 
            for row in  map(lambda row: row.asDict(), employees_agg.collect())
}


pprint(final_dict)

{
 801: [1016, 1003, 1019],
 1003: [1010, 1004, 1001],
 1004: [1012, 1002, 1015],
 1019: [1008, 1006, 1014, 1011]
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM