简体   繁体   中英

Most Efficient Way to Convert a Complex List of Dictionaries Into a Pandas Dataframe

I have a list of dictionaries in event_records and a subset of the list is below. Each dictionary contains 2 or 3 key-value pairs. The first key is item and the corresponding value is event#status .

The second key is count and the corresponding value consists of a dictionary containing 8 key-value pairs + 1 key-value pair where the value is a list of 9 dictionaries each containing 3 key-value pairs.

The third key (only present some of the time) is errors and the corresponding value is a dictionary with 3 key-value pairs in a list.

What is the most efficient way to convert the below list of dictionaries in event_records into a pandas dataframe? I tried the following code, but the speed and performance are very slow.

from pandas.io.json import json_normalize
import pandas as pd

df1 = json_normalize(event_records)
df2 = df1['customEvents']
custom_events_list = []
for element in df2: 
    df3 = json_normalize(element)
    df4 = df3[['type', 'value']]
    df5 = df4.T
    df5.columns = df5.iloc[0]
    df5 = df5[1:]
    custom_events_list.append(df5)
df6 = pd.concat(custom_events_list)
df6 = df6.reset_index(drop = True)
df7 = df1.join(df6)

df8 = df1['errors']
event_error_list = []
for element in df8: 
    df9 = json_normalize(element)
    df10 = df9[['response', 'feedback']]
    event_error_list.append(df10)
df11 = pd.concat(event_error_list)
df11 = df11.reset_index(drop = True)
df12 = df7.join(df11)
df13 = df12[['old_id', 'new_id', 'event_id', 'event_time', 'value', 'quantity', 'unique_id', 'A3', 'A4', 'A6', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'response', 'feedback']]

event_records = [{'item': 'event#status',
  'count': {'item': 'event#count',
   'old_id': '123',
   'new_id': '456',
   'event_id': '111',
   'event_time': '1200',
   'value': 1.0,
   'quantity': '1',
   'unique_id': '222',
   'customEvents': [{'item': 'event#custom', 'type': 'A3', 'value': ''},
    {'item': 'event#custom', 'type': 'A4', 'value': '11AA'},
    {'item': 'event#custom', 'type': 'A6', 'value': 'AAB1'},
    {'item': 'event#custom', 'type': 'A9', 'value': ''},
    {'item': 'event#custom', 'type': 'A10', 'value': '10.5'},
    {'item': 'event#custom', 'type': 'A11', 'value': 'ABC'},
    {'item': 'event#custom', 'type': 'A12', 'value': 'NYR'},
    {'item': 'event#custom', 'type': 'A13', 'value': 'NYR'},
    {'item': 'event#custom', 'type': 'A14', 'value': 'NYR'}]},
  'errors': [{'item': 'event#Error',
    'response': 'NONE',
    'feedback': 'Event not found'}]},
 {'item': 'event#status',
  'count': {'item': 'event#count',
   'old_id': '567',
   'new_id': '789',
   'event_id': '333',
   'event_time': '1400',
   'value': 1.0,
   'quantity': '1',
   'unique_id': '444',
   'customEvents': [{'item': 'event#custom', 'type': 'A3', 'value': ''},
    {'item': 'event#custom', 'type': 'A4', 'value': '22BB'},
    {'item': 'event#custom', 'type': 'A6', 'value': 'CCD1'},
    {'item': 'event#custom', 'type': 'A9', 'value': ''},
    {'item': 'event#custom', 'type': 'A10', 'value': '20.5'},
    {'item': 'event#custom', 'type': 'A11', 'value': 'ABC'},
    {'item': 'event#custom', 'type': 'A12', 'value': 'NYR'},
    {'item': 'event#custom', 'type': 'A13', 'value': 'NYR'},
    {'item': 'event#custom', 'type': 'A14', 'value': 'NYR'}]}}]

The desired Pandas dataframe output is as follows:

old_id    new_id    event_id    event_time    value    quantity    unique_id    A3    A4    A6    A9    A10    A11    A12    A13    A14    response    feedback
123       456       111         1200          1.0      1           222                11AA  AAB1        10.5   ABC    NYR    NYR    NYR    NONE        Event not found
567       789       333         1400          1.0      1           444                22BB  CCD1        20.5   ABC    NYR    NYR    NYR

Adding to data-frames is a slow process because each addition recreates the entire object. In your code you create 13 data-frames. I recommend you do all of the formatting outside of the data-frame object, and then create the data-frame in one fell swoop. There are multiple ways to create a data-frame (see this geeks for geeks page for a few examples), and you could choose whichever is easiest for you

The way that seems the fastest to me would be to iterate through the event records list as follows:

processed_records = []
for event_record in event_records:
    processed_records.append(process_record(event_record))

df = pd.DataFrame(processed_records)

Then you need to write a function called "process_record") that extracts all the relevant data out of the event records and returns it in a dictionary format (eg {"old_id": 123, "new_id": 345, "event_id": 567..."feedback": None} ). There are a few quirks you'll have to make sure to pay attention to. Because some records dont have errors, you need to make sure you add "None" or -1 or some other value to indicate a null value on this column. Otherwise you will get a "Nan" in your column in pandas. This will take a bit of tedious code, but it will be much faster than the version where you create 12 unnecessary data-frames.

EDIT: Clarified code

The data processing here is quite elegant thanks to pandas json_normalize & list comprehension.

first extract the custom events

parent_fields = ['old_id', 'new_id', 'event_id', 'event_time', 'value', 'quantity', 'unique_id']
custom_events = json_normalize(
    [r['count'] for r in event_records], 
    'customEvents',
    parent_fields,
    record_prefix='#'
)

then extract the errors . Here I use a lesser known feature of list comprehensions that allow both filtering & iteration on nested elements to generate the records that are fed into the DataFrame constructor

errors = pd.DataFrame(
   [(e['response'], e['feedback'],r['count']['unique_id'])
    for r in event_records if 'errors' in r
    for e in r['errors']], 
   columns=['response', 'feedback', 'unique_id'])

merge the two dataframes

df = custom_events.merge(
    errors, 
    left_on='unique_id', 
    right_on='unique_id',
    how='left'
)
shaped = df.set_index(
    [c for c in df.columns if c != '#value']
).unstack('#type')

at this point, shaped is a dataframe with the desired shape, however the columns are still a multi-index instead of a flat list.

#shaped outputs:
                                                                                    #value
#type                                                                                  A10  A11  A12  A13  A14 A3    A4    A6 A9
old_id new_id event_id event_time value quantity unique_id response feedback
123    456    111      1200       1.0   1        222       NONE     Event not found   10.5  ABC  NYR  NYR  NYR     11AA  AAB1
567    789    333      1400       1.0   1        444       NaN      NaN               20.5  ABC  NYR  NYR  NYR     22BB  CCD1

set the columns to the 2nd level in the multi-index & reset the dataframe's index, and if you want to you can re-order the columns

shaped.columns = shaped.columns.levels[1]
shaped.reset_index()
# outputs:
#type old_id new_id event_id event_time  value quantity unique_id response         feedback   A10  A11  A12  A13  A14 A3    A4    A6 A9
0        123    456      111       1200    1.0        1       222     NONE  Event not found  10.5  ABC  NYR  NYR  NYR     11AA  AAB1
1        567    789      333       1400    1.0        1       444      NaN              NaN  20.5  ABC  NYR  NYR  NYR     22BB  CCD1

i'll suggest we create the three dataframes and concatenate after. Also, some of the data is nested in lists, nested in dicts, and nested in lists. some journey. Personally, i use a library(jmespath) to make the journey easier, and IMHO, simpler. as i said, personally. here goes:

import jmespath
from collections import defaultdict

First we create the dataframe for the ids; the nesting isn't so deep here, a list comprehension(well a nested list comprehension) should do the trick, plus it is a more sensible approach here:

df1 = pd.DataFrame({key:value 
                    for key,value 
                    in entry['count'].items()
                    if key not in  ('customEvents','item')} 
                   for entry in event_records)

df1

  old_id    new_id  event_id    event_time  value   quantity    unique_id
0   123     456     111           1200       1.0       1         222
1   567     789     333           1400       1.0       1         444

Second dataframe to pull out is the 'As'; this is where jmespath comes into play, as it allows easy traversal of nested lists/dicts. you could write a nested list comprehension here, but jmespath allows us avoid that nestedness:

the path to customEvents is: list -> dict -> count -> customEvents -> list<br> keys are accessed in jmespath via the dot(.) symbol, while lists are accessed via the brackets([]) symbol

As =jmespath.compile('[].count.customEvents[]')
out = As.search(event_records)

print(out)

[{'item': 'event#custom', 'type': 'A3', 'value': ''},
 {'item': 'event#custom', 'type': 'A4', 'value': '11AA'},
 {'item': 'event#custom', 'type': 'A6', 'value': 'AAB1'},
 {'item': 'event#custom', 'type': 'A9', 'value': ''},
 {'item': 'event#custom', 'type': 'A10', 'value': '10.5'},
 {'item': 'event#custom', 'type': 'A11', 'value': 'ABC'},
 {'item': 'event#custom', 'type': 'A12', 'value': 'NYR'},
 {'item': 'event#custom', 'type': 'A13', 'value': 'NYR'},
 {'item': 'event#custom', 'type': 'A14', 'value': 'NYR'},
 {'item': 'event#custom', 'type': 'A3', 'value': ''},
 {'item': 'event#custom', 'type': 'A4', 'value': '22BB'},
 {'item': 'event#custom', 'type': 'A6', 'value': 'CCD1'},
 {'item': 'event#custom', 'type': 'A9', 'value': ''},
 {'item': 'event#custom', 'type': 'A10', 'value': '20.5'},
 {'item': 'event#custom', 'type': 'A11', 'value': 'ABC'},
 {'item': 'event#custom', 'type': 'A12', 'value': 'NYR'},
 {'item': 'event#custom', 'type': 'A13', 'value': 'NYR'},
 {'item': 'event#custom', 'type': 'A14', 'value': 'NYR'}]

next, we use the defaultdict option to extract our type and value keys

d = defaultdict(list)

for i in out:
    d[i['type']].append(i['value'])

print(d)

defaultdict(list,
            {'A3': ['', ''],
             'A4': ['11AA', '22BB'],
             'A6': ['AAB1', 'CCD1'],
             'A9': ['', ''],
             'A10': ['10.5', '20.5'],
             'A11': ['ABC', 'ABC'],
             'A12': ['NYR', 'NYR'],
             'A13': ['NYR', 'NYR'],
             'A14': ['NYR', 'NYR']})

read it into a dataframe:

df2 = pd.DataFrame(d)
df2

  A3     A4      A6     A9  A10     A11 A12 A13 A14
0       11AA    AAB1        10.5    ABC NYR NYR NYR
1       22BB    CCD1        20.5    ABC NYR NYR NYR

third part is to extract the errors data: same concept of [] for lists and . for keys apply here as well; however, we can get back our data in a key:value pair, like a dict:

errors = jmespath.compile('[].errors[].{response:response,feedback:feedback}')
err = errors.search(event_records)

print(err)

[{'response': 'NONE', 'feedback': 'Event not found'}]

read into a dataframe:

df3 = pd.DataFrame(err)
df3

    response    feedback
0   NONE    Event not found

and we are at the end - concatenate the dataframes on the columns:

result = pd.concat([df1,df2,df3],axis = 1)



   old_id  new_id   event_id    event_time  value   quantity    unique_id   A3  A4  A6  A9  A10 A11 A12 A13 A14 response    feedback
0   123     456      111         1200       1.0         1       222            11AA AAB1        10.5    ABC NYR NYR NYR NONE    Event not found
1   567     789     333         1400        1.0         1       444           22BB  CCD1        20.5    ABC NYR NYR NYR NaN NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM