简体   繁体   中英

Convert Nested Json to CSV Python

I am trying to convert a complex json (in nested format) to csv .

{
"caudal": [
{"ts": 1612746051248, "value": "0.0"}, 
{"ts": 1612745450856, "value": "0.0"}, 
{"ts": 1612744250898, "value": "0.0"}, 
{"ts": 1612743650861, "value": "0.0"}, 
{"ts": 1612743050821, "value": "0.0"} 
], 
"FreeHeap": [
{"ts": 1612746051248, "value": "247564"}, 
{"ts": 1612745450856, "value": "247564"}, 
{"ts": 1612744250898, "value": "247564"}, 
{"ts": 1612743650861, "value": "247564"}, 
{"ts": 1612743050821, "value": "247564"} 
], 
"MinimoFreeHeap": [
{"ts": 1612746051248, "value": "237440"}, 
{"ts": 1612745450856, "value": "237440"}, 
{"ts": 1612744250898, "value": "237440"}, 
{"ts": 1612743650861, "value": "237440"}, 
{"ts": 1612743050821, "value": "237440"} 
]
} 

The jsons that my program must process contain many more records, but I made it smaller to simplify the analysis.I have tried using pandas library as follows:

import pandas as pd

with open('read.json') as f_input:
    df = pd.read_json(f_input)

df.to_csv('out.csv', encoding='utf-8', index=False)

And I get the following result:

caudal,FreeHeap,MinimoFreeHeap
"{'ts': 1612746051248, 'value': '0.0'}","{'ts': 1612746051248, 'value': '247564'}","{'ts': 1612746051248, 'value': '237440'}"
"{'ts': 1612745450856, 'value': '0.0'}","{'ts': 1612745450856, 'value': '247564'}","{'ts': 1612745450856, 'value': '237440'}"
"{'ts': 1612744250898, 'value': '0.0'}","{'ts': 1612744250898, 'value': '247564'}","{'ts': 1612744250898, 'value': '237440'}"
"{'ts': 1612743650861, 'value': '0.0'}","{'ts': 1612743650861, 'value': '247564'}","{'ts': 1612743650861, 'value': '237440'}"
"{'ts': 1612743050821, 'value': '0.0'}","{'ts': 1612743050821, 'value': '247564'}","{'ts': 1612743050821, 'value': '237440'}"

As you can see, the information is each cell is for example:

"{'ts': 1612743050821, 'value': '247564'}"

Which I understand is another Json.. Is there any simple way to add a column named timestamp ( ts ) and only put the values in the cells where this json is now? I believe this would be the correct way, my goal is to transform the information contained in the json into csv format to make it more accessible to be used by third parties (databases or artificial intelligence algorithms). But if you can think of another way or format that is more convenient, I am open to changing my initial idea. I have to admit that I am new to this world.

I thought about going through the json and doing the conversion manually, but it becomes difficult to relate the measurements that have the same timestamp.

Nicolás

You don't say how you want the data so the code posted below converts it into a tabular format with a column each for machine (not sure if that's right), ts and value.

import pandas as pd
import json

with open('read.json') as f_input:
    data = json.load(f_input)

df = pd.DataFrame.from_dict(data, orient='columns')

df_new = pd.DataFrame(columns=['machine', 'ts', 'value'])
data=[]

for col in df.columns:
  for index,row in df[col].iteritems():
    ts, value = row.values()
    data.append({'machine':col, 'ts':ts, 'value':value})
    
df_new = df_new.append(data)

df_new.to_csv('out.csv', encoding='utf-8', index=False)

If you want the columns to be the timestamp and the machines change the last two rows to this

df_new = df_new.append(data).pivot(index='ts', columns='machine', values='value')

df_new.to_csv('out.csv', encoding='utf-8')
  • As per the timing analysis for this question , pd.DataFrame(df[col].values.tolist()) is the fastest way to normalize a single level dict from a column, but this answer shows how to deal with columns that are problematic (eg result in errors when trying .values.tolist() ).
import pandas as pd

# read the json file
with open('read.json') as f_input:
    df = pd.read_json(f_input)

# create a new dataframe for the normalized columns from df
normed_df = pd.DataFrame()

# iterate through each column, normalize it, and append it to normed_df
for col in df.columns:
    normed = pd.DataFrame(df[col].values.tolist())  # normalize the column from df
    normed['type'] = col  # add the original column name as a new column so the associated values can be identified
    normed_df = normed_df.append(normed)  # append to normed_df

# convert ts to a datetime dtype
normed_df.ts = pd.to_datetime(normed_df.ts, unit='ms')

# reset the index
normed_df = normed_df.reset_index(drop=True)

# save this long form to a csv
normed_df.to_csv('long.csv', index=False)

# display(normed_df)
                        ts   value            type
0  2021-02-08 01:00:51.248     0.0          caudal
1  2021-02-08 00:50:50.856     0.0          caudal
2  2021-02-08 00:30:50.898     0.0          caudal
3  2021-02-08 00:20:50.861     0.0          caudal
4  2021-02-08 00:10:50.821     0.0          caudal
5  2021-02-08 01:00:51.248  247564        FreeHeap
6  2021-02-08 00:50:50.856  247564        FreeHeap
7  2021-02-08 00:30:50.898  247564        FreeHeap
8  2021-02-08 00:20:50.861  247564        FreeHeap
9  2021-02-08 00:10:50.821  247564        FreeHeap
10 2021-02-08 01:00:51.248  237440  MinimoFreeHeap
11 2021-02-08 00:50:50.856  237440  MinimoFreeHeap
12 2021-02-08 00:30:50.898  237440  MinimoFreeHeap
13 2021-02-08 00:20:50.861  237440  MinimoFreeHeap
14 2021-02-08 00:10:50.821  237440  MinimoFreeHeap
  • Use .pivot to align the data with ts as the index.
# pivot normed_df to a wide format
dfp = normed_df.pivot(index='ts', columns='type', values='value')

# display(dfp)
type                    FreeHeap MinimoFreeHeap caudal
ts                                                    
2021-02-08 00:10:50.821   247564         237440    0.0
2021-02-08 00:20:50.861   247564         237440    0.0
2021-02-08 00:30:50.898   247564         237440    0.0
2021-02-08 00:50:50.856   247564         237440    0.0
2021-02-08 01:00:51.248   247564         237440    0.0

# save this wide form to a csv
dfp.reset_index().to_csv('wide.csv', index=False)

Finally I found the solution... There is a really interesting library called " cherrypicker ". With the examples and dataframes from pandas I figured out how to make it work. The code is the following:

import pandas as pd
from cherrypicker import CherryPicker
import json

keys = {'FreeHeap', 'MinimoFreeHeap', 'caudal'} #In the future there will be more keys

with open('read.json') as f_input:
     data = json.load(f_input)

     
     
picker = CherryPicker(data)
pos = 0
for colum in keys:
    flat = picker[colum].flatten().get()
    df = pd.DataFrame(flat)
    df.columns = ['TimeStamp', colum]  #Rename
    if(pos == 0):
        fin = df
        print(fin)
        pos = 1
    else:
        del df['TimeStamp']            #Remove timestamp because it is repeated
        fin[colum] = df     
        print(fin)

fin.to_csv('out.csv', encoding='utf-8', index=False)

I hope it will be useful to someone in the future, I am not sure if it is the simplest way but it works for me! Greetings

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM