简体   繁体   中英

Generate a new dataframe with boolean column based on comparing other dataframes values

I have the following 3 dataframes with shape of (8004,29) and the following schema as an example:

        id      var0    var1    var2    var3    var4 ...  var29
        5171    10.0    2.8     0.0     5.0     1.0  ...  9.4  
        5171    40.9    2.5     3.4     4.5     1.3  ...  7.7  
        5171    60.7    3.1     5.2     6.6     3.4  ...  1.0
        ...
        5171    0.5     1.3     5.1     0.5     0.2  ...  0.4
        4567    1.5     2.0     1.0     4.5     0.1  ...  0.4  
        4567    4.4     2.0     1.3     6.4     0.1  ...  3.3  
        4567    6.3     3.0     1.5     7.6     1.6  ...  1.6
        ...
        4567    0.7     1.4     1.4     0.3     4.2  ...  1.7
       ... 
        9584    0.3     2.6     0.0     5.2     1.6  ...  9.7  
        9584    0.5     1.2     8.3     3.4     1.3  ...  1.7  
        9584    0.7     3.0     5.6     6.6     3.0  ...  1.0
        ...
        9584    0.7     1.3     0.1     0.0     2.0  ...  1.7

where each id has 58 elements or rows and there are 138 unique id s.

I am only interested in the last column of these dataframes: column var29 . What i need to do is the following comparison:

if df1['var29'] > (df2['var29'] + df3['var29']) or
   df1['var29'] < (df2['var29'] - df3['var29']) 

and generate a new dataframe as a result:

        id      result 
        5171    True   
        5171    True   
        5171    False   
        ...
        5171    False    
        4567    True    
        4567    True    
        4567    True    
        ...
        4567    False    
       ... 
        9584    True   
        9584    False    
        9584    False    
        ...
        9584    True    

I tried to loop over each index and use lamda to generate result dataframe as follow but it failed:

idxs = unique(df1.index).tolist()
results = pd.DataFrame(index=df1.index)
for idx in idxs:
    results['result'] = df1.loc[idx]['var29'].apply(lambda x: True if (
                (df2['var29'].loc[idx] - df3['var29'].loc[idx]) > x or (
                    df2['var29'].loc[idx] + df3['var29'].loc[idx]) < x) else False)

Can someone help me to generate it?

Here's a way to do it, we map the columns into a single dataframe to ensure the ids are mapped correctly:

# create a new data
new_df = df1.copy()
new_df['df2'] = new_df['id'].map(df2.set_index('id')['var29'])
new_df['df3'] = new_df['id'].map(df3.set_index('id')['var29'])

# use conditions
cond = (new_df['var29'] > (new_df['df2'] + new_df['df3'])) | (new_df['var29'] < (new_df['df2'] - new_df['df3']))
new_df['result'] = np.where(cond, True, False)

#choose columns
new_df = new_df[['id','result']]

Sample Data

df1 = pd.DataFrame({'id': list(range(10)),'var29': np.random.randn(10)})
df2 = pd.DataFrame({'id': list(range(10)), 'var29': np.random.randn(10)})
df3 = pd.DataFrame({'id': list(range(10)), 'var29': np.random.randn(10)})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM