简体   繁体   中英

udf to parse string json in pyspark dataframe

I have a pyspark dataframe which contains string json. Looks like below:

+---------------------------------------------------------------------------+
|col                                                                        | 
+---------------------------------------------------------------------------+
|{"fields":{"list1":[{"list2":[{"list3":[{"type":false}]}]}]}}            | 
+----------------------------------------------------------------------------+--

I wrote udfs to try to parse the json and then count the value that matches phone and return to a new column in df

def item_count(json,type):
    count=0
    for i in json.get("fields",{}).get("list1",[]):
        for j in i.get("list2",[]):
            for k in j.get("list3",[]):
                count+=k.get("type",None)==type
    return count

def item_phone_count(json):
    return item_count(json,False)

df2= df\
.withColumn('item_phone_count', (F.udf(lambda j: item_phone_count(json.loads(j)), t.StringType()))('col'))

But I got the error:

AttributeError: 'NoneType' object has no attribute 'get'

Any idea what's wrong?

Check for none and skip those entries:

def item_count(json,type):
    count = 0
    if (json is None) or (json.get("fields",{}) is None):
        return count  
   
    for i in json.get("fields",{}).get("list1",[]):
        if i is None:
            continue
        for j in i.get("list2",[]):
            if j is None:
                continue 
            for k in j.get("list3",[]):
                if k is None:
                    continue 
                count += k.get("type",None) == type
    return count

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM