簡體   English   中英

基於比較其他數據幀值生成帶有布爾列的新數據幀

[英]Generate a new dataframe with boolean column based on comparing other dataframes values

我有以下 3 個形狀為 (8004,29) 的數據框和以下模式作為示例:

        id      var0    var1    var2    var3    var4 ...  var29
        5171    10.0    2.8     0.0     5.0     1.0  ...  9.4  
        5171    40.9    2.5     3.4     4.5     1.3  ...  7.7  
        5171    60.7    3.1     5.2     6.6     3.4  ...  1.0
        ...
        5171    0.5     1.3     5.1     0.5     0.2  ...  0.4
        4567    1.5     2.0     1.0     4.5     0.1  ...  0.4  
        4567    4.4     2.0     1.3     6.4     0.1  ...  3.3  
        4567    6.3     3.0     1.5     7.6     1.6  ...  1.6
        ...
        4567    0.7     1.4     1.4     0.3     4.2  ...  1.7
       ... 
        9584    0.3     2.6     0.0     5.2     1.6  ...  9.7  
        9584    0.5     1.2     8.3     3.4     1.3  ...  1.7  
        9584    0.7     3.0     5.6     6.6     3.0  ...  1.0
        ...
        9584    0.7     1.3     0.1     0.0     2.0  ...  1.7

其中每個id有 58 個元素或行,並且有 138 個唯一id

我只對這些數據var29的最后一列感興趣:列var29 我需要做的是以下比較:

if df1['var29'] > (df2['var29'] + df3['var29']) or
   df1['var29'] < (df2['var29'] - df3['var29']) 

並因此生成一個新的數據幀:

        id      result 
        5171    True   
        5171    True   
        5171    False   
        ...
        5171    False    
        4567    True    
        4567    True    
        4567    True    
        ...
        4567    False    
       ... 
        9584    True   
        9584    False    
        9584    False    
        ...
        9584    True    

我嘗試遍歷每個索引並使用 lamda 生成如下結果數據幀,但失敗了:

idxs = unique(df1.index).tolist()
results = pd.DataFrame(index=df1.index)
for idx in idxs:
    results['result'] = df1.loc[idx]['var29'].apply(lambda x: True if (
                (df2['var29'].loc[idx] - df3['var29'].loc[idx]) > x or (
                    df2['var29'].loc[idx] + df3['var29'].loc[idx]) < x) else False)

有人可以幫我生成它嗎?

這是一種方法,我們將列映射到單個數據幀中以確保正確映射 id:

# create a new data
new_df = df1.copy()
new_df['df2'] = new_df['id'].map(df2.set_index('id')['var29'])
new_df['df3'] = new_df['id'].map(df3.set_index('id')['var29'])

# use conditions
cond = (new_df['var29'] > (new_df['df2'] + new_df['df3'])) | (new_df['var29'] < (new_df['df2'] - new_df['df3']))
new_df['result'] = np.where(cond, True, False)

#choose columns
new_df = new_df[['id','result']]

樣本數據

df1 = pd.DataFrame({'id': list(range(10)),'var29': np.random.randn(10)})
df2 = pd.DataFrame({'id': list(range(10)), 'var29': np.random.randn(10)})
df3 = pd.DataFrame({'id': list(range(10)), 'var29': np.random.randn(10)})

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM