简体   繁体   中英

Pandas DataFrame w/nested dictionary

After review of similar questions on SO , I have been unable to find a solution to DataFrame formatting with a nested dictionary to a desired outcome.

Being new to Pandas and moderately new to Python, I have spent the better part of two days trying and failing at various potential solutions ( json_normalize , dictionary flattening , pd.concat , etc..).

I have a method which creates a DataFrame from a API call:

def make_dataframes(self):
    # removed non-related code    
    self._data_frame_counts = pd.DataFrame({
            'Created': (self._data_frame_30days.count()['Created']),
            'Closed': (self._data_frame_30days.count()['Closed']),
            'Owner':
            (self._data_frame_30days['Owner'].value_counts().to_dict()),
            'Resolution':
            (self._data_frame_30days['Resolution'].value_counts().to_dict()),
            'Severity':
            (self._data_frame_30days['Severity'].value_counts().to_dict())
        })

that writes a nested dictionary from Pandas value_count/s:

{'Created': 35,
 'Closed': 6,
 'Owner': {'aName': 30, 'first.last': 3, 'last.first': 2},
 'Resolution': {'TruePositive': 5, 'FalsePositive': 1},
 'Severity': {2: 31, 3: 4}}

Which after execution looks like:

                  Created Closed  Owner  Resolution  Severity
aName             35       6     30.0         NaN       NaN
first.last        35       6      3.0         NaN       NaN
last.first        35       6      2.0         NaN       NaN
TruePositive      35       6      NaN         5.0       NaN
FalsePositive     35       6      NaN         1.0       NaN
2                 35       6      NaN         NaN      31.0
3                 35       6      NaN         NaN       4.0

I want it to look like the following. Where data is accurately aligned with axis and accounts for missing data-points not present in the dictionary but could be there in future runs.

                Created Closed  Owner   Resolution  Severity
total           35      6       NaN     NaN         NaN
aName           NaN     NaN     30      NaN         NaN
first.last      NaN     NaN     3       NaN         NaN
last.first      NaN     NaN     2       NaN         NaN
anotherName     NaN     NaN     NaN     NaN         NaN
1               NaN     NaN     NaN     NaN         0
2               NaN     NaN     NaN     NaN         31
3               NaN     NaN     NaN     NaN         4
second.Name     NaN     NaN     NaN     NaN         NaN
third.name      NaN     NaN     NaN     NaN         NaN
TruePositive    NaN     NaN     NaN     5           NaN
FalsePositive   NaN     NaN     NaN     1           NaN

Assuming I have a dictionary d

d = {
    'Created': 35,
    'Closed': 6,
    'Owner': {'aName': 30, 'first.last': 3, 'last.first': 2},
    'Resolution': {'TruePositive': 5, 'FalsePositive': 1},
    'Severity': {2: 31, 3: 4}
}

I'd create some additional keys

_d = {
    'Created': {'total': d['Created']},
    'Closed': {'total': d['Closed']},
    'Severity': {k: d['Severity'].get(k, 0) for k in range(1, 4)}
}

pd.DataFrame({**d, **_d})

               Created  Closed  Owner  Resolution  Severity
total             35.0     6.0    NaN         NaN       NaN
aName              NaN     NaN   30.0         NaN       NaN
first.last         NaN     NaN    3.0         NaN       NaN
last.first         NaN     NaN    2.0         NaN       NaN
TruePositive       NaN     NaN    NaN         5.0       NaN
FalsePositive      NaN     NaN    NaN         1.0       NaN
1                  NaN     NaN    NaN         NaN       0.0
2                  NaN     NaN    NaN         NaN      31.0
3                  NaN     NaN    NaN         NaN       4.0

This is my way of updating a few of your keys and we can see what I did:

print(_d)

{'Created': {'total': 35}, 'Closed': {'total': 6}, 'Severity': {0: 0, 2: 31, 3: 4}}

By default, the pandas.DataFrame constructor can take a dictionary and use the keys as column names. What it does with the values depends on the values.

  • If the value is a scalar, it broadcast that scalar for all index values. (This is what you saw with the repeated 35 for all rows in the 'Created' column.
  • If the value is an array-like thing, the length of that thing better match the number of rows as it will element by element plug that array into the column.
  • If the value is a dictionary, it will map each key/value pair into the column where the keys are index values.

The last item is what motivated my answer. I changed the scalar value of 35 to a dictionary where I specified the index value {'total': 35}


I'd recommend changing the original method to something like this:

def make_dataframes(self):
    # removed non-related code    
    counts = self._data_frame_30days['Severity'].value_counts().to_dict()
    self._data_frame_counts = pd.DataFrame({
            'Created': {'total': self._data_frame_30days.count()['Created']},
            'Closed': {'total': self._data_frame_30days.count()['Closed']},
            'Owner':
            (self._data_frame_30days['Owner'].value_counts().to_dict()),
            'Resolution':
            (self._data_frame_30days['Resolution'].value_counts().to_dict()),
            'Severity': {k: counts.get(k, 0) for k in sorted({k, *counts})}
        })

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM