Trying to import a JSON file which is list of dictionaries, one of which contains two epoch time values (start time and end time). Now, if all instances include both times, no problem--pandas json.normalize will load all values correctly as they appear in the source data. But because some end-time values are missing altogether, python needs to have an 'Nan' and so the entire column of the df is apparently string, AND, apparently because of that (I don't know), all values that DO exist are changed to exponential notation, rendering the data useless.
Here's a typical JSON file:
[ {
"id": "U1Q6MDpERVNLVE9Q",
"A": {
"ProcessName": "A3463453",
"SubProcess": "M1",
"Machine": "46"
},
"B": {
"user": "a12",
"polhode": "343282"
},
"C": {
"rotorState": "m-mode",
"startTime": 1671540600000,
"endTime": 1672963068453,
"ProcessElapsedMs": 6142877
},
"D": "fb6a9154-44a2-3d60-b978-f1d2ad1a68ff"
}, {
"id": "QVNUOjA6REVTS1RP",
"A": {
"ProcessName": "A3465453",
"SubProcess": "M1",
"Machine": "47"
},
"B": {
"user": "a12",
"polhode": "343282"
},
"C": {
"rotorState": "f-mode",
"startTime": 1671720693000,
"ProcessElapsedMs": 71973000
},
"D": "28e160c9-d954-35d7-a077-70fc70711baf"
}, {
"id": "NUOjA6REVTS1RPUA",
"A": {
"ProcessName": "A3465453",
"SubProcess": "M3",
"Machine": "48"
},
"B": {
"user": "a12",
"polhode": "343282"
},
"C": {
"rotorState": "m-mode",
"startTime": 1673000200000,
"endTime": 1673001028516,
"ProcessElapsedMs": 10160506
},
"D": "ed7077f2-b64c-3944-a0c3-9f0612826c85"
}, {
"id": "U1Q6MDpERVNLVE9Q",
"A": {
"ProcessName": "A3463853",
"SubProcess": "M3",
"Machine": "49"
},
"B": {
"user": "a12",
"polhode": "343282"
},
"C": {
"rotorState": "m-mode",
"startTime": 1673006529000,
"endTime": 1673001028516,
"ProcessElapsedMs": 3832128
},
"D": "0d671793-9679-3e72-9862-f31ad75cfd89"
}, {
"id": "zMzg5OkRFU0tUT1A",
"A": {
"ProcessName": "A3476553",
"SubProcess": "M18",
"Machine": "31"
},
"B": {
"user": "a12",
"polhode": "343282"
},
"C": {
"rotorState": "m-mode",
"startTime": 1671758829000,
"endTime": 1672916208140,
"ProcessElapsedMs": 3832128
},
"D": "1ab25dec-c7d8-3ea8-8dbf-c7c48beaa65a"
}
]
and, using this code
json_file = open(full_json_path)
json_data = json.loads(json_file.read())
json_df = pd.json_normalize(json_data, max_level=1)
insert_df = json_df[["A.ProcessName", "A.SubProcess", "C.rotorState", "C.startTime", "C.endTime", "C.ProcessElapsedMs"]]
insert_df.columns=["ProcessName", "SubProcess", "GState", "StartEpoch", "EndEpoch", "ProcessElapsedMs"]
print(insert_df.to_string())
I get this result:
A.ProcessName A.SubProcess C.rotorState C.startTime C.endTime C.ProcessElapsedMs
0 A3463453 M1 m-mode 1671540600000 1.672963e+12 6142877 1 A3465453 M1 f-mode 1671720693000 NaN 71973000 2 A3465453 M3 m-mode 1673000200000 1.673001e+12 10160506 3 A3463853 M3 m-mode 1673006529000 1.673001e+12 3832128 4 A3476553 M18 m-mode 1671758829000 1.672916e+12 3832128
On the other hand, if the data looks like this, no missing EndEpoch values:
ProcessName SubProcess GState StartEpoch EndEpoch ProcessElapsedMs
0 A3463453 M1 m-mode 1671540600000 1672963068453 6142877 1 A3465453 M1 f-mode 1671720693000 1672963068453 71973000 2 A3465453 M3 m-mode 1673000200000 1673001028516 10160506 3 A3463853 M3 m-mode 1673006529000 1673001028516 3832128 4 A3476553 M18 m-mode 1671758829000 1672916208140 3832128
then the result is what I need to have:
ProcessName SubProcess GState StartEpoch EndEpoch ProcessElapsedMs
0 A3463453 M1 m-mode 1671540600000 1672963068453 6142877 1 A3465453 M1 f-mode 1671720693000 1672963068453 71973000 2 A3465453 M3 m-mode 1673000200000 1673001028516 10160506 3 A3463853 M3 m-mode 1673006529000 1673001028516 3832128 4 A3476553 M18 m-mode 1671758829000 1672916208140 3832128
What have I done?
Well, explicitly creating the dataframe with data type of integer doesn't work, because he's going to complain when there's an NaN value that can't be loaded into the integer dataframe column. Trying to do a replace of the NaN with, say a '0' doesn't work because that has no impact at all on the exp notation values that are already in the data frame.
I'm totally new to Python, and don't care what library function method I use--I just want my data to not be goofed up. I have to think this is a problem handled routinely, I just don't know what smart questions to ask.
On an unrelated note, I'd also like to be able to use something similar to record_path, if you will, to tell the json.normalize to totally ignore/bypass one or more entire dictionaries--such as "B:" in my example data, so I don't have to import it at all. If you can point me in a direction that would be very helpful.
My apologies for not using correct terminology wherever that may have happened.
EndEpoch
column contains NaN value which is treated as float, so large float number is displayed as scientific notation by default. You can convert EndEpoch
column to nullable integers with df['EndEpoch'] = df['EndEpoch'].astype('Int64')
.
# After dtype conversion
print(df)
ProcessName SubProcess GState StartEpoch EndEpoch ProcessElapsedMs
0 A3463453 VXABCSV10P10248 m-mode 1671540600000 1672963068453 6142877
1 A3465453 VXABCSACP202707 f-mode 1671720693000 <NA> 71973000
2 A3465453 VXABHEC10VA0049 m-mode 1673000200000 1673001028516 10160506
3 A3463853 VXABHCOSP101033 m-mode 1673006529000 1673001028516 3832128
4 A3476553 VXABV2CHE600143 m-mode 1671758829000 1672916208140 3832128
As far as I know, there is no way for pd.json_normalize
to ignore some part of json. record_path
and meta
argument are designed for extracting inner list of records and extra metadata from json array. Instead you can use jmespath to filter the json first.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.