简体   繁体   中英

Aggregate by value in JSON object within Pandas Dataframe in Python

I have loaded a json array to python as dataframe using pandas. My python code is as below:

import json
import pandas as pd

jsontxt = pd.read_json ('array.json')

df = pd.DataFrame(jsontxt['Total-Hours'])

print(df)

The output is as below:

    Total-Hours

0   {'value': 3.0}
1   {'value': 2.0}
2   {'value': 1.0}
3   {'value': 5.0}
4   {'value': 3.0}
5   {'value': 5.0}

I want to group the data by the value in total hours. Something like below:

val = df.groupby(['Total-Hours']).mean();

My JSON is as below:

[
              {
                "key" : "Jacob",
                "doc_count" : 11,
                "Total-Hours" : {
                  "value" : 3.0
                },
                "Calculated-Category" : {
                  "value" : 4.0
                }
              },
              {
                "key" : "AH",
                "doc_count" : 2,
                "Total-Hours" : {
                  "value" : 2.0
                },
                "Calculated-Category" : {
                  "value" : 1.0
                }
              },
              {
                "key" : "FJ",
                "doc_count" : 1,
                "Total-Hours" : {
                  "value" : 1.0
                },
                "Calculated-Category" : {
                  "value" : 4.0
                }
              },
              {
                "key" : "Helen",
                "doc_count" : 1,
                "Total-Hours" : {
                  "value" : 5.0
                },
                "Calculated-Category" : {
                  "value" : 2.0
                }
              },
              {
                "key" : "Test",
                "doc_count" : 1,
                "Total-Hours" : {
                  "value" : 3.0
                },
                "Calculated-Category" : {
                  "value" : 3.0
                }
              },
              {
                "key" : "John",
                "doc_count" : 1,
                "Total-Hours" : {
                  "value" : 5.0
                },
                "Calculated-Category" : {
                  "value" : 3.0
                }
              }
            ]

However that requires the Total-Hours to be numeric. What is the best way to achieve this?

Pandas currently understands the row values as dict types, so you update the array using the extracted 'value' key from the dictionary.

Below i am using a list comprehension which updates the dataframe, with the extracted values from the dictionary. I print the updated dataframe, and then finally print the mean.

Also note, you don't need to create a new dataframe as you already have one within jsontxt.

import pandas as pd

jsontxt = pd.read_json('array.json')

print(jsontxt)

jsontxt['Total Hours'] = [x['value'] for x in jsontxt['Total Hours']]

print(jsontxt)

print(jsontxt.mean())

Here is the output

      Total Hours
0  {'value': 3.0}
1  {'value': 2.0}
2  {'value': 1.0}
3  {'value': 5.0}
4  {'value': 3.0}
5  {'value': 5.0}
   Total Hours
0          3.0
1          2.0
2          1.0
3          5.0
4          3.0
5          5.0
Total Hours    3.166667
dtype: float64

Here is what my input file looked like:

{
    "Total Hours": [
        {"value": 3.0},
        {"value": 2.0},
        {"value": 1.0},
        {"value": 5.0},
        {"value": 3.0},
        {"value": 5.0}
    ]
}

You can treat you input as a dict, then select the total Hours column. The apply, will create a new serie from the column from which you can compute the mean

 mean_hours = pd.DataFrame.from_dict(myjson)['Total Hours'].apply(pd.Series).mean()

or from the full input (extra -)

 mean_hours = pd.DataFrame.from_dict(myjson)['Total-Hours'].apply(pd.Series).mean()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM