如何使用熊猫从嵌套JSON数组中提取值

Question

I have a large JSON file (400k lines). 我有一个很大的JSON文件（400k行）。 I am trying to isolate the following: 我正在尝试隔离以下内容：

Policies- "description" 政策-“说明”

policy items - "users" and "database values" 策略项目-“用户”和“数据库值”

JSON FILE - https://pastebin.com/hv8mLfgx JSON文件-https: //pastebin.com/hv8mLfgx

Expected Output from Pandas: https://imgur.com/a/FVcNGsZ 熊猫的预期产量： https : //imgur.com/a/FVcNGsZ

Everything after "Policy Items" is re-iterated the exact same throughout the file. 在整个“文件”中，“策略项”之后的所有内容都会重复重复。 I have tried the code below to isolate "users". 我已经尝试了下面的代码来隔离“用户”。 It doesn't seem to work, I'm trying to dump all of this into a CSV. 它似乎不起作用，我正在尝试将所有这些都转储为CSV。

Edit* here was a solution I was attempting to try, but could not get this to work correctly - Deeply nested JSON response to pandas dataframe Edit *这是我尝试尝试的解决方案，但无法使其正常工作- 对pandas dataframe的深度嵌套JSON响应

from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
    for item in jsonDF['policies'][0]['policyItems'][0]:
        print ('{} - {} - {}'.format(jsonDF['users']))

EDIT 2: I have some working code which is able to grab some of the USERS, but it does not grab all of them. 编辑2：我有一些可以抓住一些用户的工作代码，但并不能抓住所有这些用户。 Only 11 out of 25. 25中只有11。

from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
    pNode = Jnormal(jsonDF['policies'][0]['policyItems'], record_path='users')
    print(pNode.head(500))

EDIT 3: This is the Final working copy, however I am still not copying over all my TABLE data. 编辑3：这是最终的工作副本，但是我仍然没有复制我所有的TABLE数据。 I set a loop to simply ignore everything. 我设置了一个循环以简单地忽略一切。 Capture everything and I'd sort it in Excel, Does anyone have any ideas why I cannot capture all the TABLE values? 捕获所有内容，然后在Excel中对其进行排序，是否有人对我无法捕获所有TABLE值有任何想法？

    json_data = json.load(file)
    with open("test.csv", 'w', newline='') as fd:
        wr = csv.writer(fd)
        wr.writerow(('Database name', 'Users', 'Description', 'Table'))
        for policy in json_data['policies']:
            desc = policy['description']
            db_values = policy['resources']['database']['values']
            db_tables = policy['resources']['table']['values']
            for item in policy['policyItems']:
                users = item['users']
                for dbT in db_tables:
                    for user in users:
                        for db in db_values:
                            _ = wr.writerow((db, user, desc, dbT))```

Answer 1

Pandas is overkill here: the csv standard module is enough. 在这里，Pandas太过强大了：csv标准模块就足够了。 You have just to iterate on policies to extract the description an database values, next on policyItems to extract the users: 您只需迭代策略以提取描述和数据库值，接下来访问policyItems以提取用户：

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
with open("outputfile.csv", newline='') as fd:
    wr = csv.writer(fd)
    _ = wr.writerow(('Database name', 'Users', 'Description'))
    for policy in js['policies']:
        desc = policy['description']
        db_values = policy['resources']['database']['values']
        for item in policy['policyItems']:
            users = item['users']
            for user in users:
                for db in db_values:
                    if db != '*':
                        _ = wr.writerow((db, user, desc))

Answer 2

Here is one way to do it, and let's assume your json data is in a variable called json_data 这是一种实现方法，假设您的json数据位于名为json_data的变量中

from itertools import product

def make_dfs(data):
    cols = ['db_name', 'user', 'description']

    for item in data.get('policies'):
        description = item.get('description')
        users = item.get('policyItems', [{}])[0].get('users', [None])
        db_name = item.get('resources', {}).get('database', {}).get('values', [None])
        db_name = [name for name in db_name if name != '*']
        prods = product(db_name, users, [description])
        yield pd.DataFrame.from_records(prods, columns=cols)

df = pd.concat(make_dfs(json_data), ignore_index=True)

print(df)

   db_name          user                               description
0    m2_db          hive  Policy for all - database, table, column
1    m2_db  rangerlookup  Policy for all - database, table, column
2    m2_db     ambari-qa  Policy for all - database, table, column
3    m2_db          af34  Policy for all - database, table, column
4    m2_db          g748  Policy for all - database, table, column
5    m2_db          hdfs  Policy for all - database, table, column
6    m2_db          dh10  Policy for all - database, table, column
7    m2_db          gs22  Policy for all - database, table, column
8    m2_db          dh27  Policy for all - database, table, column
9    m2_db          ct52  Policy for all - database, table, column
10   m2_db  livy_pyspark  Policy for all - database, table, column

Tested on Python 3.5.1 and pandas==0.23.4 在Python 3.5.1和pandas==0.23.4上测试

如何使用熊猫从嵌套JSON数组中提取值

问题描述

2 个解决方案

解决方案1
2 2019-02-12 19:12:24

解决方案2
1 2019-02-12 18:46:13

如何使用熊猫从嵌套JSON数组中提取值

问题描述

2 个解决方案

解决方案1 2 2019-02-12 19:12:24

解决方案2 1 2019-02-12 18:46:13

解决方案1
2 2019-02-12 19:12:24

解决方案2
1 2019-02-12 18:46:13