简体   繁体   中英

Data structure manipulation with Pandas

I have a list of dicts as follows :

[
{
    "status": "BV", 
    "max_total_duration": null, 
    "min_total_duration": null, 
    "75th_percentile": 420, 
    "median": 240.0, 
    "25th_percentile": 180, 
    "avg_total_duration": null
}, 
{
    "status": "CORR", 
    "max_total_duration": null, 
    "min_total_duration": null, 
    "75th_percentile": 1380, 
    "median": 720.0, 
    "25th_percentile": 420, 
    "avg_total_duration": null
}, 
{
    "status": "FILL", 
    "max_total_duration": null, 
    "min_total_duration": null, 
    "75th_percentile": 1500, 
    "median": 840.0, 
    "25th_percentile": 480, 
    "avg_total_duration": null
}, 
{
    "status": "INIT", 
    "max_total_duration": 11280, 
    "min_total_duration": 120, 
    "75th_percentile": 720, 
    "median": 360.0, 
    "25th_percentile": 180, 
    "avg_total_duration": 2061
}, 
]

As is evident,max_total_duration,min_total_duration and avg_total_duration is null for all status except when status is "INIT".What I would want is to remove all the entries for null values and for INIT where max_total_duration,min_total_duration and avg_total_duration have correct values, add them as a new dictionary in the list as follows:

[
{
    "status": "BV", 
    "75th_percentile": 420, 
    "median": 240.0, 
    "25th_percentile": 180, 
}, 
{
    "status": "CORR", 
    "75th_percentile": 1380, 
    "median": 720.0, 
    "25th_percentile": 420, 
}, 
{
    "status": "FILL", 
    "75th_percentile": 1500, 
    "median": 840.0, 
    "25th_percentile": 480, 
}, 
{
    "status": "INIT", 
    "75th_percentile": 720, 
    "median": 360.0, 
    "25th_percentile": 180, 

}, 
{
    "max_total_duration": 11280, 
    "min_total_duration": 120,
    "avg_total_duration": 2061,
}
]    

I have tried doing this by iterating over the list and it is computationally very expensive.Is there an easier way of doing this with pandas ?

data =[
{
    "status": "BV", 
    "max_total_duration": None, 
    "min_total_duration": None, 
    "75th_percentile": 420, 
    "median": 240.0, 
    "25th_percentile": 180, 
    "avg_total_duration": None
}, 
{
    "status": "CORR", 
    "max_total_duration": None, 
    "min_total_duration": None, 
    "75th_percentile": 1380, 
    "median": 720.0, 
    "25th_percentile": 420, 
    "avg_total_duration": None
}, 
{
    "status": "FILL", 
    "max_total_duration": None, 
    "min_total_duration": None, 
    "75th_percentile": 1500, 
    "median": 840.0, 
    "25th_percentile": 480, 
    "avg_total_duration": None
}, 
{
    "status": "INIT", 
    "max_total_duration": 11280, 
    "min_total_duration": 120, 
    "75th_percentile": 720, 
    "median": 360.0, 
    "25th_percentile": 180, 
    "avg_total_duration": 2061
}, 
]

data = [{key: val for key, val in d.iteritems() if val} for d in data]

final = []
for d in data:
    status = d.get('status')
    if status == 'INIT':
        final.append({'max_total_duration': d.get('max_total_duration'), 'min_total_duration': d.get('min_total_duration'), 'avg_total_duration': d.get('avg_total_duration')})
        del d['max_total_duration']
        del d['min_total_duration']
        del d['avg_total_duration']
    final.append(d)
print final
import pandas as pd

# Substituting your 'null' for 'None'
df = pd.DataFrame(data)

>>> df
   25th_percentile  75th_percentile  avg_total_duration  max_total_duration  \
0              180              420                 NaN                 NaN
1              420             1380                 NaN                 NaN
2              480             1500                 NaN                 NaN
3              180              720                2061               11280

   median  min_total_duration status
0     240                 NaN     BV
1     720                 NaN   CORR
2     840                 NaN   FILL
3     360                 120   INIT

Grabbing the percentiles part:

df_percentiles = df[['status','25th_percentile','median','75th_percentile']]

>>> df_percentiles
  status  25th_percentile  median  75th_percentile
0     BV              180     240              420
1   CORR              420     720             1380
2   FILL              480     840             1500
3   INIT              180     360              720

Grabbing the durations part:

df_durations = df[df['status'] == 'INIT'][['max_total_duration','min_total_duration','avg_total_duration']]

>>> df_durations
   max_total_duration  min_total_duration  avg_total_duration
3               11280                 120                2061

Loop and combine to list:

summary = df_percentiles.T.to_dict().values()

summary.append(df_durations.T.to_dict().values())

>>> summary
[{'25th_percentile': 180,
  '75th_percentile': 420,
  'median': 240.0,
  'status': 'BV'},
 {'25th_percentile': 420,
  '75th_percentile': 1380,
  'median': 720.0,
  'status': 'CORR'},
 {'25th_percentile': 480,
  '75th_percentile': 1500,
  'median': 840.0,
  'status': 'FILL'},
 {'25th_percentile': 180,
  '75th_percentile': 720,
  'median': 360.0,
  'status': 'INIT'},
 {'avg_total_duration': 2061.0,
  'max_total_duration': 11280.0,
  'min_total_duration': 120.0}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM