Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID

Question

I'm want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surface and volume values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and volume per appartment. There are two conditions for the original dataframe:

Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID 
(so all the same values for the same ID), then the data (surface, volumes) is not 
summed but one value/row is passed to the new summary column (example: 'ID 4')(as 
this could be a mistake in the original dataframe and the total surface/volume was 
inserted for all the rooms by the government-employee)

Initial dataframe 'data':

print(data)

    ID  Surface  Volume
0    2     10.0    25.0
1    2     12.0    30.0
2    2     24.0    60.0
3    2      8.0    20.0
4    4     84.0   200.0
5    4     84.0   200.0
6    4     84.0   200.0
7   52      NaN     NaN
8   52     96.0   240.0
9   95      8.0    20.0
10  95      6.0    15.0
11  95     12.0    30.0
12  95     30.0    75.0
13  95     12.0    30.0

Desired output from 'df' :

print(df)
    ID  Surface  Volume
0    2     54.0   135.0
1    4     84.0   200.0  #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2   52     96.0   240.0
3   95     68.0   170.0

Tried code:

import pandas as pd 
import numpy as np  

df = pd.DataFrame({"ID": [2,4,52,95]})  

data = pd.DataFrame({"ID":  [2,2,2,2,4,4,4,52,52,95,95,95,95,95],                 
                "Surface":  [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],                  
                 "Volume":  [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})  
print(data)  


#Tried something, but no idea how to do this actually:

df["Surface"] = data.groupby("ID").agg(sum) 
df["Volume"] = data.groupby("ID").agg(sum)
print(df)

Answer 1

Here are necessary 2 conditions - first testing if unique values per groups for each columns separately by GroupBy.transform and DataFrameGroupBy.nunique and compare by eq for equal with 1 and then second condition - it used DataFrame.duplicated by each column with ID column.

Chain both masks by & for bitwise AND and repalce matched values by NaN s by DataFrame.mask and last aggregate sum :

cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())

df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
   ID  Surface  Volume
0   2     54.0   135.0
1   4     84.0   200.0
2  52     96.0   240.0
3  95     68.0   170.0

If need new columns filled by aggregate sum values use GroupBy.transform :

cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())

data[cols] = data[cols].mask(m1 & m2).groupby(data["ID"]).transform('sum')
print(data)
    ID  Surface  Volume
0    2     54.0   135.0
1    2     54.0   135.0
2    2     54.0   135.0
3    2     54.0   135.0
4    4     84.0   200.0
5    4     84.0   200.0
6    4     84.0   200.0
7   52     96.0   240.0
8   52     96.0   240.0
9   95     68.0   170.0
10  95     68.0   170.0
11  95     68.0   170.0
12  95     68.0   170.0
13  95     68.0   170.0

Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID

Question

1 answers

solution1
1 ACCPTED 2020-04-19 07:29:35

Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID

Question

1 answers

solution1 1 ACCPTED 2020-04-19 07:29:35

solution1
1 ACCPTED 2020-04-19 07:29:35