简体   繁体   English

python 嵌套 json 到具有指定标头的 csv/xlsx

[英]python nested json to csv/xlsx with specified headers

With json like below, which is an array of objects at the outer most level with further nested arrays with objects.使用 json 如下所示,它是最外层的对象数组,其中进一步嵌套了 arrays 对象。

data = [{"a": [{"a1": [{"id0": [{"aa": [{"aaa": 97}, {"aab": "one"}], "ab": [{"aba": 97}, {"abb": ["one", "two"]}]}]}, {"id1": [{"aa": [{"aaa": 23}]}]}]}, {"a2": []}]}, {"b": [{"b1": [{"Common": [{"bb": [{"value": 4}]}]}]}]}]

I need to write this to a csv (or.xlsx file)我需要将此写入 csv (或.xlsx 文件)

what I've tried so far?到目前为止我尝试了什么?

data_file = open('data_file.csv', 'w')
csv_writer = csv.writer(data_file)
for row in data:
  csv_writer.writerow(row)
data_file.close() 

This gives an empty file 'data_file.csv'.这给出了一个空文件“data_file.csv”。

Also how do I add headers to the CSV.另外,如何将标题添加到 CSV。 I have the headers stored in a list as below我将标题存储在如下列表中

hdrs = ['Section', 'Subsection', 'pId', 'Group', 'Parameter', 'Value'] 

- this corresponds to the five levels of keys - 这对应于五个级别的键

Expected CSV output预期 CSV output

+---------+------------+--------+-------+-----------+----------+
| Section | Subsection |  pId   | Group | Parameter |  Value   |
+---------+------------+--------+-------+-----------+----------+
| a       | a1         | id0    | aa    | aaa       | 97       |
| a       | a1         | id0    | aa    | aab       | one      |
| a       | a1         | id0    | ab    | aba       | 97       |
| a       | a1         | id0    | ab    | abb       | one, two |
| a       | a1         | id1    | aa    | aaa       | 23       |
| a       | a2         |        |       |           |          |
| b       | b1         | Common | bb    | value     | 4        |
+---------+------------+--------+-------+-----------+----------+

Expected XLSX output预期 XLSX output 在此处输入图像描述

Following code is able to parse the provided data as per expected format.以下代码能够按照预期格式解析提供的数据。

from typing import List

def parse_recursive(dat)->List[List]:
    ret=[]
    if type(dat) is list:
        for item in dat:
            if type(item)==dict:
                for k in item:
                    #print(k, item[k], sep=" # ")#debug print
                    if item[k]==[]: #empty list
                        ret.append([k])
                    else:
                        for l in parse_recursive(item[k]):
                            #print(k,l,sep=" : ") #debug print
                            ret.append([k]+l) #always returns List of List
            else: #Right now only possibility is string eg. "one", "two"
                return [[",".join(dat)]]
    else: #can be int or string eg. 97, "23"
        return [[dat]]

    return ret


def write_to_csv(file_name:str, fields:List, row_data:List[List]):
    import csv
    with open(file_name, 'w') as csvfile:  
        # creating a csv writer object  
        csvwriter = csv.writer(csvfile)  
        # writing the fields  
        csvwriter.writerow(fields)  
        # writing the data rows  
        csvwriter.writerows(row_data)


if __name__=="__main__":
    org_data = [{"a": [
        {"a1": [
            {"id0": [
                {
                    "aa": [
                        {"aaa": 97},
                        {"aab": "one"}],
                    "ab": [
                        {"aba": 97},
                        {"abb": ["one", "two"]}
                        ]
                }
            ]
            },
            {"id1": [
                {"aa": [
                    {"aaa": 23}]}]}
            ]
        },
        {"a2": []}
        ]},
        {"b": [{"b1": [{"Common": [{"bb": [{"value": 4}]}]}]}]}]
    print(parse_recursive(org_data)) #Debug

    file_name="data_file.csv"
    fields=['Section', 'Subsection', 'pId', 'Group', 'Parameter', 'Value']
    write_to_csv(file_name, fields, parse_recursive(org_data))

parse_recursive tries to parse arbitrary depth dictionary as per rule i tried deducing from your input and output formats. parse_recursive尝试按照我尝试从您的输入和 output 格式推断的规则解析任意深度字典。

Following is the output of parse_recursive for your provided input -以下是您提供的输入的parse_recursive的 output -

mahorir@mahorir-Vostro-3446:~/Desktop$ python3 so.py 
[['a', 'a1', 'id0', 'aa', 'aaa', 97], ['a', 'a1', 'id0', 'aa', 'aab', 'one'], ['a', 'a1', 'id0', 'ab', 'aba', 97], ['a', 'a1', 'id0', 'ab', 'abb', 'one,two'], ['a', 'a1', 'id1', 'aa', 'aaa', 23], ['a', 'a2'], ['b', 'b1', 'Common', 'bb', 'value', 4]]

write_to_csv is a trivial function that write to a csv file. write_to_csv是一个普通的 function 写入 csv 文件。

This was kind-of a fun problem... There are really two problems with the formatting here:这是一个有趣的问题......这里的格式确实有两个问题:

  1. The data is lists of dicts, where they really just wanted dictionaries.数据是字典列表,他们真的只是想要字典。 eg they wanted {"foo": 1, "bar": 2} but instead formatted it as [{"foo": 1}, {"bar": 2}] .例如,他们想要{"foo": 1, "bar": 2}而是将其格式化为[{"foo": 1}, {"bar": 2}]

    a.一个。 I'm not judging here.我不在这里评判。 There may be reasons why they did this.他们这样做可能是有原因的。 It just makes it a bit annoying for us to parse.它只是让我们解析有点烦人。

  2. The data is sometimes truncated;数据有时会被截断; if there are usually 5 levels deep, sometimes if they don't have data beyond a point, they just omit it.如果通常有 5 个级别的深度,有时如果他们没有超过一个点的数据,他们就会忽略它。 eg 'a2' in your example.例如,在您的示例中为'a2'

So I'll show two possible approaches to solving these problems.因此,我将展示解决这些问题的两种可能方法。

The Pandas Way Pandas方式

This solution is a bit different from the other one mentioned here.此解决方案与此处提到的其他解决方案略有不同。 Let me know what you think:让我知道你的想法:

import pandas as pd
from copy import deepcopy

hdrs = ['Section', 'Subsection', 'pId', 'Group', 'Parameter', 'Value']

js = [{"a": [{"a1": [{"id0": [{"aa": [{"aaa": 97}, {"aab": "one"}],
                               "ab": [{"aba": 98}, {"abb": ["one", "two"]}]}]},
                     {"id1": [{"aa": [{"aaa": 23}]}]}
                    ]},
             {"a2": []}
            ]},
      {"b": [{"b1": [{"Common": [{"bb": [{"value": 4}]}]}]}]}]

def list_to_dict(lst):
    """convert a list of dicts as you have to a single dict

    The idea here is that you have a bunch of structures that look
    like [{x: ...}, {y: ...}] that should probably have been stored as
    {x:..., y:...}. So this function does that (but just one level in).
    
    Note:
    If there is a duplicate key in one of your dicts (meaning you have
    something like [{x:...},...,{x:...}]), then this function will overwrite
    it without warning!
    """
    d = {}
    for new_d in lst:
        d.update(new_d)
    return d

def recursive_parse(lst, levels):
    "Parse the nested json into a single pandas dataframe"
    name = levels.pop(0)  # I should have used a counter instead
    d = list_to_dict(lst)  # get a sensible dict instead of the list of dicts
    if len(levels) <= 1: # meaning there are no more levels to be parsed.
        if len(d) == 0:
            d = {'': ''} # to handle the uneven depths (e.g. think 'a2')
        return pd.Series(d, name=levels[-1])
    if len(d) == 0: # again to handle the uneven depths of json
        d = {'': []}
    # below is a list-comprehension to recursively parse the thing.
    d = {k: recursive_parse(v, deepcopy(levels)) for k, v in d.items()}
    return pd.concat(d)

def json_to_df(js, headers):
    "calls recursive_parse, and then adds the column names and whatnot"
    df = recursive_parse(js, deepcopy(headers))
    df.index.names = headers[:-1]
    df = df.reset_index()
    return df
df = json_to_df(js, hdrs)
display(df)

And the output is exactly the dataframe you want (but with an index column you may not want).而 output 正是您想要的 dataframe (但您可能不想要索引列)。 If you write it to csv after, do so like this:如果之后将其写入 csv,请执行以下操作:

df.to_csv('path/to/desired/file.csv', index=False)

Does that make sense?那有意义吗?

The minimalist way极简主义的方式

Better version (not using pandas)...更好的版本(不使用熊猫)...

import csv

hdrs = ['Section', 'Subsection', 'pId', 'Group', 'Parameter', 'Value']

js = [{"a": [{"a1": [{"id0": [{"aa": [{"aaa": 97}, {"aab": "one"}],
                               "ab": [{"aba": 98}, {"abb": ["one", "two"]}]}]},
                     {"id1": [{"aa": [{"aaa": 23}]}]}
                    ]},
             {"a2": []}
            ]},
      {"b": [{"b1": [{"Common": [{"bb": [{"value": 4}]}]}]}]}]

def list_of_dicts_to_lists(lst, n_levels=len(hdrs)):
    if n_levels == 1:
        if isinstance(lst, list):
            if len(lst) == 0: # we fill the shorter ones with empty lists
                lst = None # replacing them back to None
            else: # [1, 2] => "1,2"
                lst = ','.join(str(x) for x in lst if x is not None)
        return [[lst]] # the later ones are going to be lists of lists so let's start out that way to keep everything consistent.
    if len(lst) == 0:
        lst = [{None: []}] # filling with an empty list
    output = []
    for d in lst:
        for k, v in d.items():
            tmp = list_of_dicts_to_lists(v, n_levels - 1)
            for x in tmp:
                output.append([k] + x)
    return output

def to_csv(values, header, outfile):
    with open(outfile, 'w', newline='') as csv_file:
        # pretty much straight from the docs @
        # https://docs.python.org/3.7/library/csv.html
        csv_writer = csv.writer(csv_file, quoting=csv.QUOTE_MINIMAL)
        csv_writer.writerow(header)
        for line in values:
            csv_writer.writerow(line)
    return True

rows = list_of_dicts_to_lists(js)
to_csv(rows, hdrs, 'tmp.csv')

I see now that this solution is super similar to the other answer here... My bad.我现在看到这个解决方案与这里的其他答案非常相似......我的错。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM