[英]How to extract values from nested JSON array using pandas
I have a large JSON file (400k lines). 我有一个很大的JSON文件(400k行)。 I am trying to isolate the following: 我正在尝试隔离以下内容:
Policies- "description" 政策-“说明”
policy items - "users" and "database values" 策略项目-“用户”和“数据库值”
JSON FILE - https://pastebin.com/hv8mLfgx JSON文件-https: //pastebin.com/hv8mLfgx
Expected Output from Pandas: https://imgur.com/a/FVcNGsZ 熊猫的预期产量: https : //imgur.com/a/FVcNGsZ
Everything after "Policy Items" is re-iterated the exact same throughout the file. 在整个“文件”中,“策略项”之后的所有内容都会重复重复。 I have tried the code below to isolate "users". 我已经尝试了下面的代码来隔离“用户”。 It doesn't seem to work, I'm trying to dump all of this into a CSV. 它似乎不起作用,我正在尝试将所有这些都转储为CSV。
Edit* here was a solution I was attempting to try, but could not get this to work correctly - Deeply nested JSON response to pandas dataframe Edit *这是我尝试尝试的解决方案,但无法使其正常工作- 对pandas dataframe的深度嵌套JSON响应
from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re
with open("Ranger_Policies_20190204_195010.json") as file:
jsonDF = json.load(file)
for item in jsonDF['policies'][0]['policyItems'][0]:
print ('{} - {} - {}'.format(jsonDF['users']))
EDIT 2: I have some working code which is able to grab some of the USERS, but it does not grab all of them. 编辑2:我有一些可以抓住一些用户的工作代码,但并不能抓住所有这些用户。 Only 11 out of 25. 25中只有11。
from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re
with open("Ranger_Policies_20190204_195010.json") as file:
jsonDF = json.load(file)
pNode = Jnormal(jsonDF['policies'][0]['policyItems'], record_path='users')
print(pNode.head(500))
EDIT 3: This is the Final working copy, however I am still not copying over all my TABLE data. 编辑3:这是最终的工作副本,但是我仍然没有复制我所有的TABLE数据。 I set a loop to simply ignore everything. 我设置了一个循环以简单地忽略一切。 Capture everything and I'd sort it in Excel, Does anyone have any ideas why I cannot capture all the TABLE values? 捕获所有内容,然后在Excel中对其进行排序,是否有人对我无法捕获所有TABLE值有任何想法?
json_data = json.load(file)
with open("test.csv", 'w', newline='') as fd:
wr = csv.writer(fd)
wr.writerow(('Database name', 'Users', 'Description', 'Table'))
for policy in json_data['policies']:
desc = policy['description']
db_values = policy['resources']['database']['values']
db_tables = policy['resources']['table']['values']
for item in policy['policyItems']:
users = item['users']
for dbT in db_tables:
for user in users:
for db in db_values:
_ = wr.writerow((db, user, desc, dbT))```
Pandas is overkill here: the csv standard module is enough. 在这里,Pandas太过强大了:csv标准模块就足够了。 You have just to iterate on policies to extract the description an database values, next on policyItems to extract the users: 您只需迭代策略以提取描述和数据库值,接下来访问policyItems以提取用户:
with open("Ranger_Policies_20190204_195010.json") as file:
jsonDF = json.load(file)
with open("outputfile.csv", newline='') as fd:
wr = csv.writer(fd)
_ = wr.writerow(('Database name', 'Users', 'Description'))
for policy in js['policies']:
desc = policy['description']
db_values = policy['resources']['database']['values']
for item in policy['policyItems']:
users = item['users']
for user in users:
for db in db_values:
if db != '*':
_ = wr.writerow((db, user, desc))
Here is one way to do it, and let's assume your json
data is in a variable called json_data
这是一种实现方法,假设您的json
数据位于名为json_data
的变量中
from itertools import product
def make_dfs(data):
cols = ['db_name', 'user', 'description']
for item in data.get('policies'):
description = item.get('description')
users = item.get('policyItems', [{}])[0].get('users', [None])
db_name = item.get('resources', {}).get('database', {}).get('values', [None])
db_name = [name for name in db_name if name != '*']
prods = product(db_name, users, [description])
yield pd.DataFrame.from_records(prods, columns=cols)
df = pd.concat(make_dfs(json_data), ignore_index=True)
print(df)
db_name user description
0 m2_db hive Policy for all - database, table, column
1 m2_db rangerlookup Policy for all - database, table, column
2 m2_db ambari-qa Policy for all - database, table, column
3 m2_db af34 Policy for all - database, table, column
4 m2_db g748 Policy for all - database, table, column
5 m2_db hdfs Policy for all - database, table, column
6 m2_db dh10 Policy for all - database, table, column
7 m2_db gs22 Policy for all - database, table, column
8 m2_db dh27 Policy for all - database, table, column
9 m2_db ct52 Policy for all - database, table, column
10 m2_db livy_pyspark Policy for all - database, table, column
Tested on Python 3.5.1
and pandas==0.23.4
在Python 3.5.1
和pandas==0.23.4
上测试
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.