简体   繁体   中英

Python Pandas - groupby conditional on another dataframe

I have two dataframes, which are identical in terms of size (rows/date index and columns/firms). What I want to do now is to calculate timeseries statistics for the observations in Dataframe1 based on the logic contained in Dataframe2. For example, I want to calulcate the timeseries average observation (Dataframe1) based on a rank (Dataframe2)

So some sort of a groupby-procedure except the fact that I use a second dataframe for the condition.

Glad for any input as I was not able to find a similar problem!

Dataframe1
----------------------------------
            A      B      C      D      E      F       G      H             
31.12.2009  30     66     NaN    NaN    NaN    NaN     393    57     
01.01.2010  30     66     NaN    NaN    NaN    NaN     393    57   
04.01.2010  31     66     NaN    NaN    NaN    NaN     404    57     
05.01.2010  33     66     NaN    NaN    NaN    NaN     400    58    
06.01.2010  33     66     NaN    NaN    NaN    NaN     400    58   


Dataframe2
----------------------------------
            A      B      C      D      E      F       G      H            
31.12.2009  1.0    2.0    NaN    NaN    NaN    NaN     2.0    1.0     
01.01.2010  1.0    2.0    NaN    NaN    NaN    NaN     2.0    1.0   
04.01.2010  1.0    1.0    NaN    NaN    NaN    NaN     2.0    2.0     
05.01.2010  1.0    2.0    NaN    NaN    NaN    NaN     1.0    2.0    
06.01.2010  2.0    2.0    NaN    NaN    NaN    NaN     1.0    1.0  


Desired output
----------------------------------
            1.0     2.0            
31.12.2009  43.5    229.5     
01.01.2010  43.5    229.5   
04.01.2010  48.5    230.5       
05.01.2010  216.5   62.0        
06.01.2010  229.0   49.5     

You can use a dictionary comprehension to create the result dataframe. Each column is generated using where to replace values in df1 by nan when the specific value is not met in df2 , to be able to use mean over axis=1 for each unique value of df2

df_res = pd.DataFrame({col: df1.where(df2.eq(col)).mean(1) for col in df2.stack().unique()})
print (df_res)
              1.0    2.0
31.12.2009   43.5  229.5
01.01.2010   43.5  229.5
04.01.2010   48.5  230.5
05.01.2010  216.5   62.0
06.01.2010  229.0   49.5

Doing each value one at a time:

(1)

df1.where(df2 == 1).mean(axis=1)

Output:

31.12.2009     43.5
01.01.2010     43.5
04.01.2010     48.5
05.01.2010    216.5
06.01.2010    229.0

(2)

df1.where(df2 == 2).mean(axis=1)

Output:

31.12.2009    229.5
01.01.2010    229.5
04.01.2010    230.5
05.01.2010     62.0
06.01.2010     49.5

Combining into a your desired output:

output = pd.DataFrame({'1':df1.where(df2 == 1).mean(axis=1),
                       '2':df1.where(df2 == 2).mean(axis=1)})
                1      2
31.12.2009   43.5  229.5
01.01.2010   43.5  229.5
04.01.2010   48.5  230.5
05.01.2010  216.5   62.0
06.01.2010  229.0   49.5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM