[英]Pyspark - get attribute names from json file
I am new to pyspark .我是 pyspark 的新手。 My requirement is to get/extract the attribute names from a nested json file .
我的要求是从嵌套的 json 文件中获取/提取属性名称。 I tried using json_normalize imported from pandas package.
我尝试使用从 Pandas 包导入的 json_normalize。 It works for direct attributes but never fetches the attributes within json array attributes.
它适用于直接属性,但从不获取 json 数组属性中的属性。 My json doesn't have a static structure.
我的 json 没有静态结构。 It varies for each document that we receive.
它因我们收到的每份文件而异。 Could someone please help me with explanation for the small example provided below,
有人可以帮我解释下面提供的小例子吗,
{
"id":"1",
"name":"a",
"salaries":[
{
"salary":"1000"
},
{
"salary":"5000"
}
],
"states":{
"state":"Karnataka",
"cities":[
{
"city":"Bangalore"
},
{
"city":"Mysore"
}
],
"state":"Tamil Nadu",
"cities":[
{
"city":"Chennai"
},
{
"city":"Coimbatore"
}
]
}
}
Especially for the json array elements..特别是对于json数组元素..
Expected output : id name salaries.salary states.state states.cities.city``预期输出:id 名称salary.salary states.state states.cities.city``
Here is the another solution for extracting all nested attributes from json这是从 json 中提取所有嵌套属性的另一种解决方案
import json
result_set = set([])
def parse_json_array(json_obj, parent_path):
array_obj = list(json_obj)
for i in range(0, len(array_obj)):
json_ob = array_obj[i]
if type(json_obj) == type(json_obj):
parse_json(json_ob, parent_path)
return None
def parse_json(json_obj, parent_path):
for key in json_obj.keys():
key_value = json_obj.get(key)
# if isinstance(a, dict):
if type(key_value) == type(json_obj):
parse_json(key_value, str(key) if parent_path == "" else parent_path + "." + str(key))
elif type(key_value) == type(list(json_obj)):
parse_json_array(key_value, str(key) if parent_path == "" else parent_path + "." + str(key))
result_set.add((parent_path + "." + key).encode('ascii', 'ignore'))
return None
file_name = "C:/input/sample.json"
file_data = open(file_name, "r")
json_data = json.load(file_data)
print json_data
parse_json(json_data, "")
print list(result_set)
Output:输出:
{u'states': {u'state': u'Tamil Nadu', u'cities': [{u'city': u'Chennai'}, {u'city': u'Coimbatore'}]}, u'id': u'1', u'salaries': [{u'salary': u'1000'}, {u'salary': u'5000'}], u'name': u'a'}
['states.cities.city', 'states.cities', '.id', 'states.state', 'salaries.salary', '.salaries', '.states', '.name']
Note:笔记:
My Python version: 2.7
you can do in this way also.你也可以这样做。
data = { "id":"1", "name":"a", "salaries":[ { "salary":"1000" }, { "salary":"5000" } ], "states":{ "state":"Karnataka", "cities":[ { "city":"Bangalore" }, { "city":"Mysore" } ], "state":"Tamil Nadu", "cities":[ { "city":"Chennai" }, { "city":"Coimbatore" } ] } }
def dict_ittr(lin,data):
for k, v in data.items():
if type(v)is list:
for l in v:
dict_ittr(lin+"."+k,l)
elif type(v)is dict:
dict_ittr(lin+"."+k,v)
pass
else:
print lin+"."+k
dict_ittr("",data)
output输出
.states.state
.states.cities.city
.states.cities.city
.id
.salaries.salary
.salaries.salary
.name
If you treat the json like a python dictionary, this should work.如果您将 json 视为 python 字典,这应该可以工作。
I just wrote a simple recursive program.我刚刚写了一个简单的递归程序。
Script脚本
import json
def js_r(filename):
with open(filename) as f_in:
return(json.load(f_in))
g = js_r("city.json")
answer_d = {}
def base_line(g, answer_d):
for key in g.keys():
answer_d[key] = {}
return answer_d
answer_d = base_line(g, answer_d)
def recurser_func(g, answer_d):
for k in g.keys():
if type(g[k]) == type([]): #If the value is a list
answer_d[k] = {list(g[k][0].keys())[0]:{}}
if type(g[k]) == type({}): #If the value is a dictionary
answer_d[k] = {list(g[k].keys())[0]: {}} #set key equal to
answer_d[k] = recurser_func(g[k], answer_d[k])
return answer_d
recurser_func(g,answer_d)
def printer_func(answer_d, list_to_print, parent):
for k in answer_d.keys():
if len(answer_d[k].keys()) == 1:
list_to_print.append(parent)
list_to_print[-1] += k
list_to_print[-1] += "." + str(list(answer_d[k].keys())[0])
if len(answer_d[k].keys()) == 0:
list_to_print.append(parent)
list_to_print[-1] += k
if len(answer_d[k].keys()) > 1:
printer_func(answer_d[k], list_to_print, k + ".")
return list_to_print
l = printer_func(answer_d, [], "")
final = " ".join(l)
print(final)
Explanation解释
base_line
makes a dictionary of all your base keys. base_line
制作了一个包含所有基本键的字典。
recursur_func
checks if the key's value is a list or dict then adds to the answer dictionary as is necessary until answer_d
looks like: {'id': {}, 'name': {}, 'salaries': {'salary': {}}, 'states': {'state': {}, 'cities': {'city': {}}}}
recursur_func
检查键的值是列表还是字典,然后根据需要添加到答案字典中,直到answer_d
看起来像: {'id': {}, 'name': {}, 'salaries': {'salary': {}}, 'states': {'state': {}, 'cities': {'city': {}}}}
After these 2 functions are called you have a dictionary of keys in a sense.在调用这两个函数之后,您在某种意义上拥有了一个键字典。 Then printer_func is a recursive function to print it as you desired.
然后 printer_func 是一个递归函数,可以根据需要打印它。
NOTE:笔记:
Your question is similar to this one: Get all keys of a nested dictionary but since you have a nested list/dictionary instead of just a nested dictionary, their answers won't work for you, but there is more discussion on the topic on that question if you like more info您的问题与此类似: 获取嵌套字典的所有键,但由于您有一个嵌套列表/字典而不仅仅是一个嵌套字典,因此他们的答案对您不起作用,但有关该主题的讨论更多询问您是否需要更多信息
EDIT 1编辑 1
my python version is 3.7.1我的 python 版本是 3.7.1
I have added a json file opener to the top.我在顶部添加了一个 json 文件打开器。 I assume that the json is named city.json and is in the same directory
我假设 json 名为 city.json 并且位于同一目录中
EDIT 2: More thorough explanation编辑 2:更彻底的解释
The main difficulty that I found with dealing with your data is the fact that you can have infinitely nested lists and dictionaries.我发现处理数据的主要困难在于您可以拥有无限嵌套的列表和字典。 This makes it complicated.
这使它变得复杂。 Since it was infinite possible nesting, I new this was a recursion problem.
由于它是无限可能的嵌套,因此我认为这是一个递归问题。
So, I build a dictionary of dictionaries representing the key structure that you are looking for.所以,我建立了一个字典,代表您正在寻找的关键结构。 Firstly I start with the baseline.
首先,我从基线开始。
base_line
makes {'id': {}, 'name': {}, 'salaries': {}, 'states': {}}
This is a dictionary of empty dictionaries. base_line
使{'id': {}, 'name': {}, 'salaries': {}, 'states': {}}
这是一个空字典的字典。 I know that when you print.我知道,当你打印。 Every key structure (like
states.state
) starts with one of these words.每个关键结构(如
states.state
)都以这些词之一开头。
recursion递归
Then I add all the child keys using recursur_func
.然后我使用
recursur_func
添加所有子键。 When given a dictionary g
this function for loop through all the keys in that dictionary and (assuming answer_d
has each key that g
has) for each key will add that keys child to answer_d.当给定字典
g
此函数 for 循环遍历该字典中的所有键,并且(假设answer_d
具有g
具有的每个键),每个键都会将该键的子项添加到 answer_d。
If the child is a dictionary.如果孩子是一本字典。 Then I recurse with the given dictionary
g
now being the sub-part of the dictionary that pertains to the children, and answer_d being the sub_part of answer_d that pertains to the child.然后我递归使用给定的字典
g
现在是与孩子有关的字典的子部分,而 answer_d 是与孩子有关的 answer_d 的子部分。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.