简体   繁体   中英

How to create a complex dictionary into Pandas DataFrame in streaming data

All kinds of nested dictionaries and Data Structures:)

I have a sample dictionary -

stream= {
    "Outerclass": {
        "Main_ID": "1",
        "SetID": "1041",
        "Version": 2,
        "nestedData": {
            "time": ["5000", "6000", "7000"],
            "value": [1, 2, 3]
        }

    } }

and I want to create a dataframe out of it like this -

  Main_ID SetID  Version  Time  Value
0     1     1041      2.0  5000      1
1     1     1041      2.0  6000      2
2     1     1041      2.0  7000      3

I have written below code to produce what i need and I know it is not a good approach, if anybody could help suggest that will be great. Also I am sure that it will perform horribly when I will run it against streaming data. These 3 dataframes will be created in a single loop and data could range from 30,000 - 1,00,000 in time and value lists.

Code-

import pandas as pd

stream =  {
    "Outerclass": {
        "Main_ID": "1",
        "SetID": "1041",
        "Version": 2,
        "nestedData": {
            "time": ["5000", "6000", "7000"],
            "value": [1, 2, 3]
        }

    } }

df_outer = pd.DataFrame(stream["Outerclass"], index=[0])
print(df_outer)


df_time = pd.DataFrame(stream["Outerclass"]["nestedData"]["time"], columns=["Time"])
print(df_time)

df_value = pd.DataFrame(stream["Outerclass"]["nestedData"]["value"], columns=["Value"])
print(df_value)

full_df = pd.concat([df_outer,df_time,df_value], sort=True, axis=1)

print(full_df)


del full_df["nestedData"]

print(full_df)

Output -

  Main_ID SetID  Version  Time  Value
0       1  1041      2.0  5000      1
1     NaN   NaN      NaN  6000      2
2     NaN   NaN      NaN  7000      3

Use json_normalize to flatten the dict to a dataframe then use explode to convert lists to rows:

stream= {
    "Outerclass": {
        "Main_ID": "1",
        "SetID": "1041",
        "Version": 2,
        "nestedData": {
            "time": ["5000", "6000", "7000"],
            "value": [1, 2, 3]
        }

    } }
df = pd.json_normalize(stream)
df = df.apply(pd.Series.explode).reset_index(drop=True)
print(df)


  Outerclass.Main_ID Outerclass.SetID  Outerclass.Version Outerclass.nestedData.time Outerclass.nestedData.value
0                  1             1041                   2                       5000                           1
1                  1             1041                   2                       6000                           2
2                  1             1041                   2                       7000                           3

We can try

from pandas.io.json import json_normalize
s = json_normalize(stream['Outerclass'])
s = s.join(pd.concat([s.pop(x).explode()  for x in ['nestedData.time','nestedData.value']],axis=1))
s
Out[222]: 
  Main_ID SetID  Version nestedData.time nestedData.value
0       1  1041        2            5000                1
0       1  1041        2            6000                2
0       1  1041        2            7000                3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM