Pyspark - get attribute names from json file

Question

I am new to pyspark . My requirement is to get/extract the attribute names from a nested json file . I tried using json_normalize imported from pandas package. It works for direct attributes but never fetches the attributes within json array attributes. My json doesn't have a static structure. It varies for each document that we receive. Could someone please help me with explanation for the small example provided below,

        {  
               "id":"1",
               "name":"a",
               "salaries":[  
                  {  
                     "salary":"1000"
                  },
                  {  
                     "salary":"5000"
                  }
               ],
               "states":{  
                  "state":"Karnataka",
                  "cities":[  
                     {  
                        "city":"Bangalore"
                     },
                     {  
                        "city":"Mysore"
                     }
                  ],
                  "state":"Tamil Nadu",
                  "cities":[  
                     {  
                        "city":"Chennai"
                     },
                     {  
                        "city":"Coimbatore"
                     }
                  ]
               }
            }

Especially for the json array elements..

Expected output : id name salaries.salary states.state states.cities.city``

Answer 1

Here is the another solution for extracting all nested attributes from json

import json

result_set = set([])


def parse_json_array(json_obj, parent_path):
    array_obj = list(json_obj)
    for i in range(0, len(array_obj)):
        json_ob = array_obj[i]
        if type(json_obj) == type(json_obj):
            parse_json(json_ob, parent_path)
    return None


def parse_json(json_obj, parent_path):
    for key in json_obj.keys():
        key_value = json_obj.get(key)
        # if isinstance(a, dict):
        if type(key_value) == type(json_obj):
            parse_json(key_value, str(key) if parent_path == "" else parent_path + "." + str(key))
        elif type(key_value) == type(list(json_obj)):
            parse_json_array(key_value, str(key) if parent_path == "" else parent_path + "." + str(key))
        result_set.add((parent_path + "." + key).encode('ascii', 'ignore'))
    return None



file_name = "C:/input/sample.json"
file_data = open(file_name, "r")
json_data = json.load(file_data)
print json_data

parse_json(json_data, "")
print list(result_set)

Output:

{u'states': {u'state': u'Tamil Nadu', u'cities': [{u'city': u'Chennai'}, {u'city': u'Coimbatore'}]}, u'id': u'1', u'salaries': [{u'salary': u'1000'}, {u'salary': u'5000'}], u'name': u'a'}
['states.cities.city', 'states.cities', '.id', 'states.state', 'salaries.salary', '.salaries', '.states', '.name']

Note:

My Python version: 2.7

Answer 2

you can do in this way also.

data = { "id":"1", "name":"a", "salaries":[ { "salary":"1000" }, { "salary":"5000" } ], "states":{ "state":"Karnataka", "cities":[ { "city":"Bangalore" }, { "city":"Mysore" } ], "state":"Tamil Nadu", "cities":[ { "city":"Chennai" }, { "city":"Coimbatore" } ] } }



def dict_ittr(lin,data):

    for k, v in data.items():
        if type(v)is list:
            for l in v:
               dict_ittr(lin+"."+k,l)
        elif type(v)is dict:
            dict_ittr(lin+"."+k,v)
            pass
        else:
            print lin+"."+k

dict_ittr("",data)

output

.states.state
.states.cities.city
.states.cities.city
.id
.salaries.salary
.salaries.salary
.name

Answer 3

If you treat the json like a python dictionary, this should work.

I just wrote a simple recursive program.

Script

import json

def js_r(filename):
    with open(filename) as f_in:
        return(json.load(f_in))

g = js_r("city.json")
answer_d = {}
def base_line(g, answer_d):
    for key in g.keys():
        answer_d[key] = {}
    return answer_d

answer_d = base_line(g, answer_d)
def recurser_func(g, answer_d):
    for k in g.keys():
        if type(g[k]) == type([]): #If the value is a list
            answer_d[k] = {list(g[k][0].keys())[0]:{}}

        if type(g[k]) == type({}): #If the value is a dictionary
            answer_d[k] = {list(g[k].keys())[0]: {}} #set key equal to 
            answer_d[k] = recurser_func(g[k], answer_d[k])
    return answer_d
recurser_func(g,answer_d)


def printer_func(answer_d, list_to_print, parent):
    for k in answer_d.keys():
        if len(answer_d[k].keys()) == 1:
            list_to_print.append(parent)
            list_to_print[-1] += k
            list_to_print[-1] += "." + str(list(answer_d[k].keys())[0])
        if len(answer_d[k].keys()) == 0:
            list_to_print.append(parent)
            list_to_print[-1] += k
        if len(answer_d[k].keys()) > 1:
            printer_func(answer_d[k], list_to_print, k + ".")
    return list_to_print



l = printer_func(answer_d, [], "")
final = " ".join(l)
print(final)

Explanation

base_line makes a dictionary of all your base keys.

recursur_func checks if the key's value is a list or dict then adds to the answer dictionary as is necessary until answer_d looks like: {'id': {}, 'name': {}, 'salaries': {'salary': {}}, 'states': {'state': {}, 'cities': {'city': {}}}}

After these 2 functions are called you have a dictionary of keys in a sense. Then printer_func is a recursive function to print it as you desired.

NOTE:

Your question is similar to this one: Get all keys of a nested dictionary but since you have a nested list/dictionary instead of just a nested dictionary, their answers won't work for you, but there is more discussion on the topic on that question if you like more info

EDIT 1

my python version is 3.7.1

I have added a json file opener to the top. I assume that the json is named city.json and is in the same directory

EDIT 2: More thorough explanation

The main difficulty that I found with dealing with your data is the fact that you can have infinitely nested lists and dictionaries. This makes it complicated. Since it was infinite possible nesting, I new this was a recursion problem.

So, I build a dictionary of dictionaries representing the key structure that you are looking for. Firstly I start with the baseline.

base_line makes {'id': {}, 'name': {}, 'salaries': {}, 'states': {}} This is a dictionary of empty dictionaries. I know that when you print. Every key structure (like states.state ) starts with one of these words.

recursion

Then I add all the child keys using recursur_func . When given a dictionary g this function for loop through all the keys in that dictionary and (assuming answer_d has each key that g has) for each key will add that keys child to answer_d.

If the child is a dictionary. Then I recurse with the given dictionary g now being the sub-part of the dictionary that pertains to the children, and answer_d being the sub_part of answer_d that pertains to the child.

Pyspark - get attribute names from json file

Question

3 answers

solution1
1 ACCPTED 2019-01-10 07:13:50

solution2
1 2019-03-13 10:27:09

solution3
0 2019-01-07 23:30:23

Pyspark - get attribute names from json file

Question

3 answers

solution1 1 ACCPTED 2019-01-10 07:13:50

solution2 1 2019-03-13 10:27:09

solution3 0 2019-01-07 23:30:23

solution1
1 ACCPTED 2019-01-10 07:13:50

solution2
1 2019-03-13 10:27:09

solution3
0 2019-01-07 23:30:23