简体   繁体   中英

Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID

I'm want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surface and volume values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and volume per appartment. There are two conditions for the original dataframe:

Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID 
(so all the same values for the same ID), then the data (surface, volumes) is not 
summed but one value/row is passed to the new summary column (example: 'ID 4')(as 
this could be a mistake in the original dataframe and the total surface/volume was 
inserted for all the rooms by the government-employee)

Initial dataframe 'data':

print(data)

    ID  Surface  Volume
0    2     10.0    25.0
1    2     12.0    30.0
2    2     24.0    60.0
3    2      8.0    20.0
4    4     84.0   200.0
5    4     84.0   200.0
6    4     84.0   200.0
7   52      NaN     NaN
8   52     96.0   240.0
9   95      8.0    20.0
10  95      6.0    15.0
11  95     12.0    30.0
12  95     30.0    75.0
13  95     12.0    30.0

Desired output from 'df' :

print(df)
    ID  Surface  Volume
0    2     54.0   135.0
1    4     84.0   200.0  #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2   52     96.0   240.0
3   95     68.0   170.0

Tried code:

import pandas as pd

import numpy as np



df = pd.DataFrame({"ID": [2,4,52,95]})



data = pd.DataFrame({"ID":  [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
                
                "Surface":  [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
                 
                 "Volume":  [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})


print(data)




#Tried something, but no idea how to do this actually:

df["Surface"] = data.groupby("ID").agg(sum)

df["Volume"] = data.groupby("ID").agg(sum)
print(df)

Here are necessary 2 conditions - first testing if unique values per groups for each columns separately by GroupBy.transform and DataFrameGroupBy.nunique and compare by eq for equal with 1 and then second condition - it used DataFrame.duplicated by each column with ID column.

Chain both masks by & for bitwise AND and repalce matched values by NaN s by DataFrame.mask and last aggregate sum :

cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())

df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
   ID  Surface  Volume
0   2     54.0   135.0
1   4     84.0   200.0
2  52     96.0   240.0
3  95     68.0   170.0

If need new columns filled by aggregate sum values use GroupBy.transform :

cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())

data[cols] = data[cols].mask(m1 & m2).groupby(data["ID"]).transform('sum')
print(data)
    ID  Surface  Volume
0    2     54.0   135.0
1    2     54.0   135.0
2    2     54.0   135.0
3    2     54.0   135.0
4    4     84.0   200.0
5    4     84.0   200.0
6    4     84.0   200.0
7   52     96.0   240.0
8   52     96.0   240.0
9   95     68.0   170.0
10  95     68.0   170.0
11  95     68.0   170.0
12  95     68.0   170.0
13  95     68.0   170.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM