简体   繁体   English

如何将 python function 编写为返回字典类型的 udf

[英]How to write a python function as udf which returns a dictionary type

I am working with pyspark.我正在使用 pyspark。 I have a spark data frame which is of the following format我有一个火花数据框,格式如下

| person_id | person_attributes
____________________________________________________________________________
| id_1    "department=Sales__title=Sales_executive__level=junior"
| id_2    "department=Engineering__title=Software Engineer__level=entry-level" 

I have written a python function which takes the person_id and person_attributes and returns a json of the following format {"id_1":{"properties":[{"department":'Sales'},{"title":'Sales_executive'},{}]}} I have written a python function which takes the person_id and person_attributes and returns a json of the following format {"id_1":{"properties":[{"department":'Sales'},{"title":'Sales_executive'},{}]}}

But I don't how to register this as a udf in pyspark with proper output type.但我不知道如何使用正确的udf类型在pyspark其注册为 udf。 Here is the python code这是 python 代码

def create_json_from_string(pid,attribute_string):
    results = []
    attribute_map ={}
    output = {}

    # Split the attribute_string into key,value pair and store it in attribute map
    if attribute_string != '':
        attribute_string = attribute_string.split("__") # This will be a list 
        for substring in attribute_string:
            k,v = substring.split("=")
            attribute_map[str(k)] = str(v)

    for k,v in attribute_map.items():
        temp = {k:v}
        results.append(temp)

    output ={pid : {"properties": results }}
    return(output)

You need to modify your function to just return map for a string, not to form the full structure.您需要修改 function 以仅返回 map 作为字符串,而不是形成完整的结构。 After that, function could be applied to an individual column, not to the whole row.之后,function 可以应用于单个列,而不是整行。 Something like this:像这样的东西:

from pyspark.sql.types import MapType,StringType
from pyspark.sql.functions import col

def struct_from_string(attribute_string):
    attribute_map ={}
    if attribute_string != '':
        attribute_string = attribute_string.split("__") # This will be a list 
        for substring in attribute_string:
            k,v = substring.split("=")
            attribute_map[str(k)] = str(v)
    return attribute_map

my_parse_string_udf = spark.udf.register("my_parse_string", struct_from_string, 
     MapType(StringType(), StringType()))

and then it could be used as following:然后它可以如下使用:

df2 = df.select(col("person_id"), my_parse_string_udf(col("person_attributes")))

In spark UDF's are considered as black box and if you want dataframe api based solution在火花 UDF 被视为黑匣子,如果你想要 dataframe api 基于解决方案

spark 2.4+火花 2.4+

Create Dataframe创建 Dataframe

df=spark.createDataFrame([('id_1',"department=Sales__title=Sales_executive__level=junior"),('id_2',"department=Engineering__title=Software Engineer__level=entry-level")],['person_id','person_attributes'])

df.show()
+---------+--------------------+
|person_id|   person_attributes|
+---------+--------------------+
|     id_1|department=Sales_...|
|     id_2|department=Engine...|
+---------+--------------------+

Convert person_attributes in map format以 map 格式转换 person_attributes

df2 = df.select('person_id',f.map_from_arrays(f.expr('''transform(transform(split(person_attributes,'__'),x->split(x,'=')),y->y[0])'''),
         f.expr('''transform(transform(split(person_attributes,'__'),x->split(x,'=')),y->y[1])''')).alias('value'))

df2.show(2,False)

+---------+-----------------------------------------------------------------------------+
|person_id|value                                                                        |
+---------+-----------------------------------------------------------------------------+
|id_1     |[department -> Sales, title -> Sales_executive, level -> junior]             |
|id_2     |[department -> Engineering, title -> Software Engineer, level -> entry-level]|
+---------+-----------------------------------------------------------------------------+

create your required structure创建您需要的结构

df2.select(f.create_map('person_id',f.create_map(f.lit('properties'),'value')).alias('json')).toJSON().collect()

['{"json":{"id_1":{"properties":{"department":"Sales","title":"Sales_executive","level":"junior"}}}}',
 '{"json":{"id_2":{"properties":{"department":"Engineering","title":"Software Engineer","level":"entry-level"}}}}']

You can collect or use the dataframe directly, incase of collect use this您可以直接收集或使用dataframe,如果收集使用这个

import json
for i in data:
    d = json.loads(i)
    print(d['json'])

{'id_1': {'properties': {'department': 'Sales', 'title': 'Sales_executive', 'level': 'junior'}}}
{'id_2': {'properties': {'department': 'Engineering', 'title': 'Software Engineer', 'level': 'entry-level'}}}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何编写返回字典的 python function - How do I write a python function that returns a dictionary 您如何为 function 编写名为 makeWordLengthDict 的程序,该程序将单词列表作为其唯一参数,并在 python 中返回字典 - how do you write a program for function named makeWordLengthDict which takes a LIST of words as its only parameter, and returns a dictionary in python 如何编写 numpy function 在 python 中将它的第 k 对角线返回为 0? - how to write numpy function which returns it's kth diagonal as 0 in python? 如何在Hive中为用户定义的聚合函数编写Python UDF - How to write a Python UDF for User Defined Aggregate Function in Hive 如何编写使用Python返回函数的函数? - How to write a function that returns a function in Python? 如何:PIG中的Python UDF字典返回架构 - How to : Python UDF dictionary return schema in PIG python - 如何编写一个python函数,当输入为列表时返回列表,当输入为非列表时返回非列表值? - How to write a python function which returns a list when an input is a list and returns a non-list value when an input is a non-list? 如何在函数中更改python词典中的类型? - how to change type in python dictionary in function? 编写一个接受元组对象列表并返回包含所有字符串值之和的字典的函数 - write a function which accepts a list of tuple objects and returns a dictionary containing the sum of values of all the strings 返回字典值的python函数 - a python function that returns the value of a dictionary
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM