I am trying to convert a complex json (in nested format) to csv
.
{
"caudal": [
{"ts": 1612746051248, "value": "0.0"},
{"ts": 1612745450856, "value": "0.0"},
{"ts": 1612744250898, "value": "0.0"},
{"ts": 1612743650861, "value": "0.0"},
{"ts": 1612743050821, "value": "0.0"}
],
"FreeHeap": [
{"ts": 1612746051248, "value": "247564"},
{"ts": 1612745450856, "value": "247564"},
{"ts": 1612744250898, "value": "247564"},
{"ts": 1612743650861, "value": "247564"},
{"ts": 1612743050821, "value": "247564"}
],
"MinimoFreeHeap": [
{"ts": 1612746051248, "value": "237440"},
{"ts": 1612745450856, "value": "237440"},
{"ts": 1612744250898, "value": "237440"},
{"ts": 1612743650861, "value": "237440"},
{"ts": 1612743050821, "value": "237440"}
]
}
The jsons that my program must process contain many more records, but I made it smaller to simplify the analysis.I have tried using pandas library as follows:
import pandas as pd
with open('read.json') as f_input:
df = pd.read_json(f_input)
df.to_csv('out.csv', encoding='utf-8', index=False)
And I get the following result:
caudal,FreeHeap,MinimoFreeHeap
"{'ts': 1612746051248, 'value': '0.0'}","{'ts': 1612746051248, 'value': '247564'}","{'ts': 1612746051248, 'value': '237440'}"
"{'ts': 1612745450856, 'value': '0.0'}","{'ts': 1612745450856, 'value': '247564'}","{'ts': 1612745450856, 'value': '237440'}"
"{'ts': 1612744250898, 'value': '0.0'}","{'ts': 1612744250898, 'value': '247564'}","{'ts': 1612744250898, 'value': '237440'}"
"{'ts': 1612743650861, 'value': '0.0'}","{'ts': 1612743650861, 'value': '247564'}","{'ts': 1612743650861, 'value': '237440'}"
"{'ts': 1612743050821, 'value': '0.0'}","{'ts': 1612743050821, 'value': '247564'}","{'ts': 1612743050821, 'value': '237440'}"
As you can see, the information is each cell is for example:
"{'ts': 1612743050821, 'value': '247564'}"
Which I understand is another Json.. Is there any simple way to add a column named timestamp ( ts
) and only put the values in the cells where this json is now? I believe this would be the correct way, my goal is to transform the information contained in the json into csv format to make it more accessible to be used by third parties (databases or artificial intelligence algorithms). But if you can think of another way or format that is more convenient, I am open to changing my initial idea. I have to admit that I am new to this world.
I thought about going through the json and doing the conversion manually, but it becomes difficult to relate the measurements that have the same timestamp.
Nicolás
You don't say how you want the data so the code posted below converts it into a tabular format with a column each for machine (not sure if that's right), ts and value.
import pandas as pd
import json
with open('read.json') as f_input:
data = json.load(f_input)
df = pd.DataFrame.from_dict(data, orient='columns')
df_new = pd.DataFrame(columns=['machine', 'ts', 'value'])
data=[]
for col in df.columns:
for index,row in df[col].iteritems():
ts, value = row.values()
data.append({'machine':col, 'ts':ts, 'value':value})
df_new = df_new.append(data)
df_new.to_csv('out.csv', encoding='utf-8', index=False)
If you want the columns to be the timestamp and the machines change the last two rows to this
df_new = df_new.append(data).pivot(index='ts', columns='machine', values='value')
df_new.to_csv('out.csv', encoding='utf-8')
pd.DataFrame(df[col].values.tolist())
is the fastest way to normalize a single level dict
from a column, but this answer shows how to deal with columns that are problematic (eg result in errors when trying .values.tolist()
).import pandas as pd
# read the json file
with open('read.json') as f_input:
df = pd.read_json(f_input)
# create a new dataframe for the normalized columns from df
normed_df = pd.DataFrame()
# iterate through each column, normalize it, and append it to normed_df
for col in df.columns:
normed = pd.DataFrame(df[col].values.tolist()) # normalize the column from df
normed['type'] = col # add the original column name as a new column so the associated values can be identified
normed_df = normed_df.append(normed) # append to normed_df
# convert ts to a datetime dtype
normed_df.ts = pd.to_datetime(normed_df.ts, unit='ms')
# reset the index
normed_df = normed_df.reset_index(drop=True)
# save this long form to a csv
normed_df.to_csv('long.csv', index=False)
# display(normed_df)
ts value type
0 2021-02-08 01:00:51.248 0.0 caudal
1 2021-02-08 00:50:50.856 0.0 caudal
2 2021-02-08 00:30:50.898 0.0 caudal
3 2021-02-08 00:20:50.861 0.0 caudal
4 2021-02-08 00:10:50.821 0.0 caudal
5 2021-02-08 01:00:51.248 247564 FreeHeap
6 2021-02-08 00:50:50.856 247564 FreeHeap
7 2021-02-08 00:30:50.898 247564 FreeHeap
8 2021-02-08 00:20:50.861 247564 FreeHeap
9 2021-02-08 00:10:50.821 247564 FreeHeap
10 2021-02-08 01:00:51.248 237440 MinimoFreeHeap
11 2021-02-08 00:50:50.856 237440 MinimoFreeHeap
12 2021-02-08 00:30:50.898 237440 MinimoFreeHeap
13 2021-02-08 00:20:50.861 237440 MinimoFreeHeap
14 2021-02-08 00:10:50.821 237440 MinimoFreeHeap
.pivot
to align the data with ts
as the index.# pivot normed_df to a wide format
dfp = normed_df.pivot(index='ts', columns='type', values='value')
# display(dfp)
type FreeHeap MinimoFreeHeap caudal
ts
2021-02-08 00:10:50.821 247564 237440 0.0
2021-02-08 00:20:50.861 247564 237440 0.0
2021-02-08 00:30:50.898 247564 237440 0.0
2021-02-08 00:50:50.856 247564 237440 0.0
2021-02-08 01:00:51.248 247564 237440 0.0
# save this wide form to a csv
dfp.reset_index().to_csv('wide.csv', index=False)
Finally I found the solution... There is a really interesting library called " cherrypicker ". With the examples and dataframes from pandas I figured out how to make it work. The code is the following:
import pandas as pd
from cherrypicker import CherryPicker
import json
keys = {'FreeHeap', 'MinimoFreeHeap', 'caudal'} #In the future there will be more keys
with open('read.json') as f_input:
data = json.load(f_input)
picker = CherryPicker(data)
pos = 0
for colum in keys:
flat = picker[colum].flatten().get()
df = pd.DataFrame(flat)
df.columns = ['TimeStamp', colum] #Rename
if(pos == 0):
fin = df
print(fin)
pos = 1
else:
del df['TimeStamp'] #Remove timestamp because it is repeated
fin[colum] = df
print(fin)
fin.to_csv('out.csv', encoding='utf-8', index=False)
I hope it will be useful to someone in the future, I am not sure if it is the simplest way but it works for me! Greetings
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.