从符合特定标准的观察中创建新的 Pandas Dataframe

Question

I have two original dataframes.我有两个原始数据框。 One contains limits: df_limits一个包含限制： df_limits

           feat_1   feat_2  feat_3  
target       12       9       90    
UL           15       10      120   
LL           9        8       60

where target is ideal value, UL - upper limit, LL - lower limit其中目标是理想值，UL - 上限，LL - 下限

And another one original data: df_to_check还有一个原始数据： df_to_check

ID          feat_1  feat_2  feat_3  
123         12.5    9.6     100 
456         18      3       100
789         9       11      100

I'm creating a function which desired output is get ID and features which are below or above the threshold (limits from first Dataframe) Till now I'm able to recognise which features are out of limits but I'm getting full output of original Dataframe...我正在创建一个 function ，它希望 output 获得 ID 和低于或高于阈值的特性（来自第一个数据帧的限制）直到现在我能够识别哪些特性超出了限制，但我得到了完整的 Z78E6221F6221F639F1DB 原始 Z78E6221F639F1F639F1D Dataframe...

def table(df_limits, df_to_check, column):
    
    UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
    LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
    
    UL_index = df_to_check.loc[df_to_check[column] > UL].index
    LL_index = df_to_check.loc[df_to_check[column] < LL].index
    
    if UL_index is not None:
        above_limit = {'ID': df_to_check['ID'],
                       'column': df_to_check[column],
                       'target':  df_limits[column].loc['target']}
        
    return pd.DataFrame(above_limit)

What I should change so my desired output would be: (showing only ID and column where observations are out of limit)我应该改变什么，所以我想要的 output 将是：（仅显示 ID 和观察超出限制的列）

The best if it would show also how many percent of original value is deviate from ideal value target (I would be glad for advices how to add such a column)如果它还能显示有多少百分比的原始价值偏离了理想价值target ，那就最好了（我很乐意提供如何添加这样一列的建议）

ID     column    target    value    deviate(%)
456    feat_1    12        18       50
456    feat_2    9         3        ...
789    feat_2    9         11       ...

Now after running this function its returning whole dataset because statement says if not null... which is not null... I understand why I have this issue but I don't know how to change it现在在运行这个 function 之后，它返回整个数据集，因为语句说如果不是 null... 这不是 null... 我明白为什么我有这个问题，但我不知道如何改变

Issue is with statement if UL_index is not None: since it returning whole dataset and I'm looking for way how to replace this part if UL_index is not None:因为它返回整个数据集，我正在寻找如何替换这部分的方法

Answer 1

First of all, you have not provided a reproducible example https://stackoverflow.com/help/minimal-reproducible-example because you have not shared the code which produces the two initial dataframes.首先，您没有提供可重现的示例https://stackoverflow.com/help/minimal-reproducible-example ，因为您没有共享生成两个初始数据帧的代码。 Next time you ask a question, please keep it in mind, Without those, I made a toy example with my own (random) data.下次你问问题时，请记住，没有这些，我用自己的（随机）数据做了一个玩具示例。

I start by unpivoting what you call dataframe_to_check: that's because, if you want to check each feature independently, then that dataframe is not normalised (you might want to look up what database normalisation means).我首先取消您所谓的 dataframe_to_check：这是因为，如果您想独立检查每个功能，那么 dataframe 未标准化（您可能想查看数据库标准化的含义）。

The next step is a left outer join between the unpivoted dataframe you want to check and the (transposed) dataframe with the limits.下一步是您要检查的未透视 dataframe 和（转置的）dataframe 之间的左外连接。

Once you have that, you can easily calculate whether a row is within the range, the deviation between value and target, etc, and you can of course group this however you want.一旦你有了它，你就可以很容易地计算出一行是否在范围内，值和目标之间的偏差等，当然你可以随意分组。

My code is below.我的代码如下。 It should be easy enough to customise it to your case.根据您的情况定制它应该很容易。

import pandas as pd
import numpy as np


df_limits = pd.DataFrame(index =['min val','max val','target'])

df_limits['a']=[2,4,3]
df_limits['b']=[3,5,4.5]


df =pd.DataFrame(columns = df_limits.columns, data =np.random.rand(100,2)*6 )

df_unpiv = pd.melt( df.reset_index().rename(columns ={'index':'id'}), id_vars='id', var_name ='feature', value_name = 'value' )

# I reset the index because I couldn't get a join on a column and index, but there is probably a better way to do it
df_joined = pd.merge( df_unpiv, df_limits.transpose().reset_index().rename(columns = {'index':'feature'}), how='left', on ='feature' )
df_joined['abs diff from target'] = abs( df_joined['value'] - df_joined['target'] )

df_joined['outside range'] =  (df_joined['value'] < df_joined['min val'] ) | (df_joined['value'] > df_joined['max val'])
    
df_outside_range = df_joined.query(" `outside range` == True "  )
df_inside_range = df_joined.query(" `outside range` == False "  )

Answer 2

Approach方法

reshape重塑
merge合并
calculate计算

new_df = (df_to_check.set_index("ID").unstack().reset_index()
 .rename(columns={"level_0":"column",0:"value"})
 .merge(df_limits.T, left_on="column", right_index=True)
 .assign(deviate=lambda dfa: (dfa.value-dfa.target)/dfa.target)
)

column柱子	ID ID	value价值	target目标	UL UL	LL二	deviate偏离
feat_1壮举_1	123 123	12.5 12.5	12 12	15 15	9 9	0.0416667 0.0416667
feat_1壮举_1	456 456	18 18	12 12	15 15	9 9	0.5 0.5
feat_1壮举_1	789 789	9 9	12 12	15 15	9 9	-0.25 -0.25
feat_2壮举_2	123 123	9.6 9.6	9 9	10 10	8 8	0.0666667 0.0666667
feat_2壮举_2	456 456	3 3	9 9	10 10	8 8	-0.666667 -0.666667
feat_2壮举_2	789 789	11 11	9 9	10 10	8 8	0.222222 0.222222
feat_3壮举_3	123 123	100 100	90 90	120 120	60 60	0.111111 0.111111
feat_3壮举_3	456 456	100 100	90 90	120 120	60 60	0.111111 0.111111
feat_3壮举_3	789 789	100 100	90 90	120 120	60 60	0.111111 0.111111

Answer 3

I solved my issue maybe in bit clumsy way but it works as desired... If someone have better answer I will still appreciate:我可能以有点笨拙的方式解决了我的问题，但它可以按预期工作......如果有人有更好的答案，我仍然会很感激：

Example how to get only observations above limits, to have both just concatenate observation from UL_index and LL_index示例如何仅获得超出限制的观察结果，以将来自 UL_index 和 LL_index 的观察结果连接起来

def table(df_limits,df_to_check,column):
    
    above_limit = []
    df_above_limit = pd.DataFrame()
    
    UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
    LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
    
    UL_index = df_to_check.loc[df_to_check[column] > UL].index
    LL_index = df_to_check.loc[df_to_check[column] < LL].index
    
    df_to_check_UL = df_to_check.loc[UL_index]
    df_to_check_LL = df_to_check.loc[LL_index]
    
    above_limit = {
                    'ID': df_to_check_UL['ID'],
                    'feature value': df_to_check[column],
                    'target':  df_limits[column].loc['target']
    }
        
    df_above_limit = pd.DataFrame(above_limit, index = df_to_check_UL.index)
        
    return df_above_limit

从符合特定标准的观察中创建新的 Pandas Dataframe

问题描述

3 个解决方案

解决方案1
1 2021-02-08 11:16:26

解决方案2
1 已采纳 2021-02-08 11:26:31

解决方案3
-1 2021-02-08 11:16:12

从符合特定标准的观察中创建新的 Pandas Dataframe

问题描述

3 个解决方案

解决方案1 1 2021-02-08 11:16:26

解决方案2 1 已采纳 2021-02-08 11:26:31

解决方案3 -1 2021-02-08 11:16:12

解决方案1
1 2021-02-08 11:16:26

解决方案2
1 已采纳 2021-02-08 11:26:31

解决方案3
-1 2021-02-08 11:16:12