简体   繁体   中英

Pandas Create a column with the a sum of a nested dataframe column

How can I add a new column to a dataframe with the sum of values from a nested dataframe, without losing any of the other columns and nested data using pandas?

Specifically, I want to create a new column total_cost with the sum of all nested dataframes for a row.

I managed to create the following dataframe using a series of groupby and apply :

  user_id   description                                       unit_summary
0     111  xxx  [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1     222  xxx  [{'total_period_cost': 100, 'unit_id': 'yyy', ...

I'm trying to add the column total_cost which is the sum of total_period_cost of each nested dataframe (grouped by user_id ). How can I achieve the following dataframe?

  user_id   description   total_cost                          unit_summary
0     111  xxx  300  [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1     222  xxx  100  [{'total_period_cost': 100, 'unit_id': 'yyy', ...

My code:

import pandas as pd

series = [{
    "user_id":"111", 
    "description": "xxx",
    "unit_summary":[
        {
        "total_period_cost":100,
        "unit_id":"xxx",
        "cost_per_unit":50,
        "total_period_usage":2
        },
        {
        "total_period_cost":200,
        "unit_id":"yyy",
        "cost_per_unit":25,
        "total_period_usage": 8
        }
    ]
},
{
    "user_id":"222",
    "description": "xxx",
    "unit_summary":[
        {
            "total_period_cost":100,
            "unit_id":"yyy",
            "cost_per_unit":25,
            "total_period_usage": 4
        }
    ]
}]

df = pd.DataFrame(series)

print(df)
print(df.to_dict(orient='records'))

Here is an example of the groupby..apply code I used to achieve the series JSON object:

import pandas as pd

series = [
    {"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
    {"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
    {"user_id":"111", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 8},
    {"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 3},
    {"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 1}
]

df = pd.DataFrame(series)

sumc = (
    df.groupby(['user_id', 'unit_id', 'cost_per_unit'], as_index=False)
        .agg({'total_period_usage': 'sum'})
)

sumc['total_period_cost'] = sumc.total_period_usage * sumc.cost_per_unit

sumc = (
    sumc.groupby(['user_id'])
        .apply(lambda x: x[['total_period_cost', 'unit_id', 'cost_per_unit', 'total_period_usage']].to_dict('r'))
        .reset_index()
)

sumc = sumc.rename(columns={0:'unit_summary'})

sumc['description'] = 'xxx'

print(sumc)
print(sumc.to_dict(orient='records'))

Solved it by adding the following from the answer by anky_91:

def myf(x):
    return pd.DataFrame(x).loc[:,'total_period_cost'].sum()
# Sum all server sumbscriptions total_period_cost
sumc['total_period_cost'] = sumc['unit_summary'].apply(myf)

You can read each row in the unit_summary column as a dataframe and sum the desired column:

Method1: apply

def myf(x):
    return pd.DataFrame(x).loc[:,'total_period_cost'].sum()
df['total_cost'] = df['unit_summary'].apply(myf)

print(df)

Method2: similarly via list comprehension:

df['total_cost'] = [pd.DataFrame(i)['total_period_cost'].sum() 
                            for i in df['unit_summary'].tolist()]

Method3: using explode :

m = df['unit_summary'].explode()
df['total_cost'] = pd.DataFrame(m.tolist(),index=m.index)['total_period_cost'].sum(level=0)

  user_id description                                       unit_summary  \
0     111         xxx  [{'total_period_cost': 100, 'unit_id': 'xxx', ...   
1     222         xxx  [{'total_period_cost': 100, 'unit_id': 'yyy', ...   

   total_cost  
0         300  
1         100 

In addition to the above, starting from your original dataframe, we can also do something like below to achieve the desired output, however this wouldnot give you the series with the dicts ('unit_summary`):

(df.assign(total_cost=df['cost_per_unit']*df['total_period_usage'])
  .groupby(['user_id'],as_index=False)['total_cost'].sum().assign(description='xxxx'))

  user_id  total_cost description
0     111         300        xxxx
1     222         100        xxxx

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM