简体   繁体   中英

How to Import Integer as Numeric String and not in Scientific Notation

Trying to import a JSON file which is list of dictionaries, one of which contains two epoch time values (start time and end time). Now, if all instances include both times, no problem--pandas json.normalize will load all values correctly as they appear in the source data. But because some end-time values are missing altogether, python needs to have an 'Nan' and so the entire column of the df is apparently string, AND, apparently because of that (I don't know), all values that DO exist are changed to exponential notation, rendering the data useless.

Here's a typical JSON file:

[ {
    "id": "U1Q6MDpERVNLVE9Q",
    "A": {
        "ProcessName": "A3463453",
        "SubProcess": "M1",
        "Machine": "46"
    },
    "B": {
        "user": "a12",
        "polhode": "343282"
    },
    "C": {
        "rotorState": "m-mode",
        "startTime": 1671540600000,
        "endTime": 1672963068453,
        "ProcessElapsedMs": 6142877
    },
    "D": "fb6a9154-44a2-3d60-b978-f1d2ad1a68ff"
}, {
    "id": "QVNUOjA6REVTS1RP",
    "A": {
        "ProcessName": "A3465453",
        "SubProcess": "M1",
        "Machine": "47"
    },
    "B": {
        "user": "a12",
        "polhode": "343282"
    },
    "C": {
        "rotorState": "f-mode",
        "startTime": 1671720693000,
        "ProcessElapsedMs": 71973000
    },
    "D": "28e160c9-d954-35d7-a077-70fc70711baf"
}, {
    "id": "NUOjA6REVTS1RPUA",
    "A": {
        "ProcessName": "A3465453",
        "SubProcess": "M3",
        "Machine": "48"
    },
    "B": {
        "user": "a12",
        "polhode": "343282"
    },
    "C": {
        "rotorState": "m-mode",
        "startTime": 1673000200000,
        "endTime": 1673001028516,           
        "ProcessElapsedMs": 10160506
    },
    "D": "ed7077f2-b64c-3944-a0c3-9f0612826c85"
}, {
    "id": "U1Q6MDpERVNLVE9Q",
    "A": {
        "ProcessName": "A3463853",
        "SubProcess": "M3",
        "Machine": "49"
    },
    "B": {
        "user": "a12",
        "polhode": "343282"
    },
    "C": {
        "rotorState": "m-mode",
        "startTime": 1673006529000,
        "endTime": 1673001028516,           
        "ProcessElapsedMs": 3832128
    },
    "D": "0d671793-9679-3e72-9862-f31ad75cfd89"
}, {
    "id": "zMzg5OkRFU0tUT1A",
    "A": {
        "ProcessName": "A3476553",
        "SubProcess": "M18",
        "Machine": "31"
    },
    "B": {
        "user": "a12",
        "polhode": "343282"
    },
    "C": {
        "rotorState": "m-mode",
        "startTime": 1671758829000,
        "endTime": 1672916208140,
        "ProcessElapsedMs": 3832128         
    },
    "D": "1ab25dec-c7d8-3ea8-8dbf-c7c48beaa65a"
}

]

and, using this code

json_file = open(full_json_path)
json_data = json.loads(json_file.read())
json_df = pd.json_normalize(json_data, max_level=1)
insert_df = json_df[["A.ProcessName", "A.SubProcess", "C.rotorState", "C.startTime", "C.endTime", "C.ProcessElapsedMs"]]
insert_df.columns=["ProcessName", "SubProcess", "GState", "StartEpoch", "EndEpoch", "ProcessElapsedMs"]
print(insert_df.to_string())

I get this result:

  A.ProcessName A.SubProcess C.rotorState    C.startTime     C.endTime  C.ProcessElapsedMs

0 A3463453 M1 m-mode 1671540600000 1.672963e+12 6142877 1 A3465453 M1 f-mode 1671720693000 NaN 71973000 2 A3465453 M3 m-mode 1673000200000 1.673001e+12 10160506 3 A3463853 M3 m-mode 1673006529000 1.673001e+12 3832128 4 A3476553 M18 m-mode 1671758829000 1.672916e+12 3832128

On the other hand, if the data looks like this, no missing EndEpoch values:

  ProcessName SubProcess  GState     StartEpoch       EndEpoch  ProcessElapsedMs

0 A3463453 M1 m-mode 1671540600000 1672963068453 6142877 1 A3465453 M1 f-mode 1671720693000 1672963068453 71973000 2 A3465453 M3 m-mode 1673000200000 1673001028516 10160506 3 A3463853 M3 m-mode 1673006529000 1673001028516 3832128 4 A3476553 M18 m-mode 1671758829000 1672916208140 3832128

then the result is what I need to have:

    ProcessName SubProcess  GState     StartEpoch       EndEpoch  ProcessElapsedMs

0 A3463453 M1 m-mode 1671540600000 1672963068453 6142877 1 A3465453 M1 f-mode 1671720693000 1672963068453 71973000 2 A3465453 M3 m-mode 1673000200000 1673001028516 10160506 3 A3463853 M3 m-mode 1673006529000 1673001028516 3832128 4 A3476553 M18 m-mode 1671758829000 1672916208140 3832128

What have I done?

Well, explicitly creating the dataframe with data type of integer doesn't work, because he's going to complain when there's an NaN value that can't be loaded into the integer dataframe column. Trying to do a replace of the NaN with, say a '0' doesn't work because that has no impact at all on the exp notation values that are already in the data frame.

I'm totally new to Python, and don't care what library function method I use--I just want my data to not be goofed up. I have to think this is a problem handled routinely, I just don't know what smart questions to ask.

On an unrelated note, I'd also like to be able to use something similar to record_path, if you will, to tell the json.normalize to totally ignore/bypass one or more entire dictionaries--such as "B:" in my example data, so I don't have to import it at all. If you can point me in a direction that would be very helpful.

My apologies for not using correct terminology wherever that may have happened.

EndEpoch column contains NaN value which is treated as float, so large float number is displayed as scientific notation by default. You can convert EndEpoch column to nullable integers with df['EndEpoch'] = df['EndEpoch'].astype('Int64') .

# After dtype conversion
print(df)

  ProcessName       SubProcess  GState     StartEpoch       EndEpoch  ProcessElapsedMs
0    A3463453  VXABCSV10P10248  m-mode  1671540600000  1672963068453           6142877
1    A3465453  VXABCSACP202707  f-mode  1671720693000           <NA>          71973000
2    A3465453  VXABHEC10VA0049  m-mode  1673000200000  1673001028516          10160506
3    A3463853  VXABHCOSP101033  m-mode  1673006529000  1673001028516           3832128
4    A3476553  VXABV2CHE600143  m-mode  1671758829000  1672916208140           3832128

As far as I know, there is no way for pd.json_normalize to ignore some part of json. record_path and meta argument are designed for extracting inner list of records and extra metadata from json array. Instead you can use jmespath to filter the json first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM