[英]Create new Pandas Dataframe from observations which meets specific criteria
I have two original dataframes.我有两个原始数据框。 One contains limits: df_limits
一个包含限制: df_limits
feat_1 feat_2 feat_3
target 12 9 90
UL 15 10 120
LL 9 8 60
where target is ideal value, UL - upper limit, LL - lower limit其中目标是理想值,UL - 上限,LL - 下限
And another one original data: df_to_check
还有一个原始数据: df_to_check
ID feat_1 feat_2 feat_3
123 12.5 9.6 100
456 18 3 100
789 9 11 100
I'm creating a function which desired output is get ID and features which are below or above the threshold (limits from first Dataframe) Till now I'm able to recognise which features are out of limits but I'm getting full output of original Dataframe...我正在创建一个 function ,它希望 output 获得 ID 和低于或高于阈值的特性(来自第一个数据帧的限制) 直到现在我能够识别哪些特性超出了限制,但我得到了完整的 Z78E6221F6221F639F1DB 原始 Z78E6221F639F1F639F1D Dataframe...
def table(df_limits, df_to_check, column):
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
if UL_index is not None:
above_limit = {'ID': df_to_check['ID'],
'column': df_to_check[column],
'target': df_limits[column].loc['target']}
return pd.DataFrame(above_limit)
What I should change so my desired output would be: (showing only ID and column where observations are out of limit)我应该改变什么,所以我想要的 output 将是:(仅显示 ID 和观察超出限制的列)
The best if it would show also how many percent of original value is deviate from ideal value target
(I would be glad for advices how to add such a column)如果它还能显示有多少百分比的原始价值偏离了理想价值target
,那就最好了(我很乐意提供如何添加这样一列的建议)
ID column target value deviate(%)
456 feat_1 12 18 50
456 feat_2 9 3 ...
789 feat_2 9 11 ...
Now after running this function its returning whole dataset because statement says if not null... which is not null... I understand why I have this issue but I don't know how to change it现在在运行这个 function 之后,它返回整个数据集,因为语句说如果不是 null... 这不是 null... 我明白为什么我有这个问题,但我不知道如何改变
Issue is with statement if UL_index is not None:
since it returning whole dataset and I'm looking for way how to replace this part if UL_index is not None:
因为它返回整个数据集,我正在寻找如何替换这部分的方法
First of all, you have not provided a reproducible example https://stackoverflow.com/help/minimal-reproducible-example because you have not shared the code which produces the two initial dataframes.首先,您没有提供可重现的示例https://stackoverflow.com/help/minimal-reproducible-example ,因为您没有共享生成两个初始数据帧的代码。 Next time you ask a question, please keep it in mind, Without those, I made a toy example with my own (random) data.下次你问问题时,请记住,没有这些,我用自己的(随机)数据做了一个玩具示例。
I start by unpivoting what you call dataframe_to_check: that's because, if you want to check each feature independently, then that dataframe is not normalised (you might want to look up what database normalisation means).我首先取消您所谓的 dataframe_to_check:这是因为,如果您想独立检查每个功能,那么 dataframe 未标准化(您可能想查看数据库标准化的含义)。
The next step is a left outer join between the unpivoted dataframe you want to check and the (transposed) dataframe with the limits.下一步是您要检查的未透视 dataframe 和(转置的)dataframe 之间的左外连接。
Once you have that, you can easily calculate whether a row is within the range, the deviation between value and target, etc, and you can of course group this however you want.一旦你有了它,你就可以很容易地计算出一行是否在范围内,值和目标之间的偏差等,当然你可以随意分组。
My code is below.我的代码如下。 It should be easy enough to customise it to your case.根据您的情况定制它应该很容易。
import pandas as pd
import numpy as np
df_limits = pd.DataFrame(index =['min val','max val','target'])
df_limits['a']=[2,4,3]
df_limits['b']=[3,5,4.5]
df =pd.DataFrame(columns = df_limits.columns, data =np.random.rand(100,2)*6 )
df_unpiv = pd.melt( df.reset_index().rename(columns ={'index':'id'}), id_vars='id', var_name ='feature', value_name = 'value' )
# I reset the index because I couldn't get a join on a column and index, but there is probably a better way to do it
df_joined = pd.merge( df_unpiv, df_limits.transpose().reset_index().rename(columns = {'index':'feature'}), how='left', on ='feature' )
df_joined['abs diff from target'] = abs( df_joined['value'] - df_joined['target'] )
df_joined['outside range'] = (df_joined['value'] < df_joined['min val'] ) | (df_joined['value'] > df_joined['max val'])
df_outside_range = df_joined.query(" `outside range` == True " )
df_inside_range = df_joined.query(" `outside range` == False " )
Approach方法
new_df = (df_to_check.set_index("ID").unstack().reset_index()
.rename(columns={"level_0":"column",0:"value"})
.merge(df_limits.T, left_on="column", right_index=True)
.assign(deviate=lambda dfa: (dfa.value-dfa.target)/dfa.target)
)
column柱子 | ID ID | value价值 | target目标 | UL UL | LL二 | deviate偏离 |
---|---|---|---|---|---|---|
feat_1壮举_1 | 123 123 | 12.5 12.5 | 12 12 | 15 15 | 9 9 | 0.0416667 0.0416667 |
feat_1壮举_1 | 456 456 | 18 18 | 12 12 | 15 15 | 9 9 | 0.5 0.5 |
feat_1壮举_1 | 789 789 | 9 9 | 12 12 | 15 15 | 9 9 | -0.25 -0.25 |
feat_2壮举_2 | 123 123 | 9.6 9.6 | 9 9 | 10 10 | 8 8 | 0.0666667 0.0666667 |
feat_2壮举_2 | 456 456 | 3 3 | 9 9 | 10 10 | 8 8 | -0.666667 -0.666667 |
feat_2壮举_2 | 789 789 | 11 11 | 9 9 | 10 10 | 8 8 | 0.222222 0.222222 |
feat_3壮举_3 | 123 123 | 100 100 | 90 90 | 120 120 | 60 60 | 0.111111 0.111111 |
feat_3壮举_3 | 456 456 | 100 100 | 90 90 | 120 120 | 60 60 | 0.111111 0.111111 |
feat_3壮举_3 | 789 789 | 100 100 | 90 90 | 120 120 | 60 60 | 0.111111 0.111111 |
I solved my issue maybe in bit clumsy way but it works as desired... If someone have better answer I will still appreciate:我可能以有点笨拙的方式解决了我的问题,但它可以按预期工作......如果有人有更好的答案,我仍然会很感激:
Example how to get only observations above limits, to have both just concatenate observation from UL_index and LL_index示例如何仅获得超出限制的观察结果,以将来自 UL_index 和 LL_index 的观察结果连接起来
def table(df_limits,df_to_check,column):
above_limit = []
df_above_limit = pd.DataFrame()
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
df_to_check_UL = df_to_check.loc[UL_index]
df_to_check_LL = df_to_check.loc[LL_index]
above_limit = {
'ID': df_to_check_UL['ID'],
'feature value': df_to_check[column],
'target': df_limits[column].loc['target']
}
df_above_limit = pd.DataFrame(above_limit, index = df_to_check_UL.index)
return df_above_limit
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.