简体   繁体   English

从符合特定标准的观察中创建新的 Pandas Dataframe

[英]Create new Pandas Dataframe from observations which meets specific criteria

I have two original dataframes.我有两个原始数据框。 One contains limits: df_limits一个包含限制: df_limits

           feat_1   feat_2  feat_3  
target       12       9       90    
UL           15       10      120   
LL           9        8       60

where target is ideal value, UL - upper limit, LL - lower limit其中目标是理想值,UL - 上限,LL - 下限

And another one original data: df_to_check还有一个原始数据: df_to_check

ID          feat_1  feat_2  feat_3  
123         12.5    9.6     100 
456         18      3       100
789         9       11      100

I'm creating a function which desired output is get ID and features which are below or above the threshold (limits from first Dataframe) Till now I'm able to recognise which features are out of limits but I'm getting full output of original Dataframe...我正在创建一个 function ,它希望 output 获得 ID 和低于或高于阈值的特性(来自第一个数据帧的限制) 直到现在我能够识别哪些特性超出了限制,但我得到了完整的 Z78E6221F6221F639F1DB 原始 Z78E6221F639F1F639F1D Dataframe...

def table(df_limits, df_to_check, column):
    
    UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
    LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
    
    UL_index = df_to_check.loc[df_to_check[column] > UL].index
    LL_index = df_to_check.loc[df_to_check[column] < LL].index
    
    if UL_index is not None:
        above_limit = {'ID': df_to_check['ID'],
                       'column': df_to_check[column],
                       'target':  df_limits[column].loc['target']}
        
    return pd.DataFrame(above_limit)

What I should change so my desired output would be: (showing only ID and column where observations are out of limit)我应该改变什么,所以我想要的 output 将是:(仅显示 ID 和观察超出限制的列)

The best if it would show also how many percent of original value is deviate from ideal value target (I would be glad for advices how to add such a column)如果它还能显示有多少百分比的原始价值偏离了理想价值target ,那就最好了(我很乐意提供如何添加这样一列的建议)

ID     column    target    value    deviate(%)
456    feat_1    12        18       50
456    feat_2    9         3        ...
789    feat_2    9         11       ...

Now after running this function its returning whole dataset because statement says if not null... which is not null... I understand why I have this issue but I don't know how to change it现在在运行这个 function 之后,它返回整个数据集,因为语句说如果不是 null... 这不是 null... 我明白为什么我有这个问题,但我不知道如何改变

Issue is with statement if UL_index is not None: since it returning whole dataset and I'm looking for way how to replace this part if UL_index is not None:因为它返回整个数据集,我正在寻找如何替换这部分的方法

First of all, you have not provided a reproducible example https://stackoverflow.com/help/minimal-reproducible-example because you have not shared the code which produces the two initial dataframes.首先,您没有提供可重现的示例https://stackoverflow.com/help/minimal-reproducible-example ,因为您没有共享生成两个初始数据帧的代码。 Next time you ask a question, please keep it in mind, Without those, I made a toy example with my own (random) data.下次你问问题时,请记住,没有这些,我用自己的(随机)数据做了一个玩具示例。

I start by unpivoting what you call dataframe_to_check: that's because, if you want to check each feature independently, then that dataframe is not normalised (you might want to look up what database normalisation means).我首先取消您所谓的 dataframe_to_check:这是因为,如果您想独立检查每个功能,那么 dataframe 未标准化(您可能想查看数据库标准化的含义)。

The next step is a left outer join between the unpivoted dataframe you want to check and the (transposed) dataframe with the limits.下一步是您要检查的未透视 dataframe 和(转置的)dataframe 之间的左外连接。

Once you have that, you can easily calculate whether a row is within the range, the deviation between value and target, etc, and you can of course group this however you want.一旦你有了它,你就可以很容易地计算出一行是否在范围内,值和目标之间的偏差等,当然你可以随意分组。

My code is below.我的代码如下。 It should be easy enough to customise it to your case.根据您的情况定制它应该很容易。

import pandas as pd
import numpy as np


df_limits = pd.DataFrame(index =['min val','max val','target'])

df_limits['a']=[2,4,3]
df_limits['b']=[3,5,4.5]


df =pd.DataFrame(columns = df_limits.columns, data =np.random.rand(100,2)*6 )

df_unpiv = pd.melt( df.reset_index().rename(columns ={'index':'id'}), id_vars='id', var_name ='feature', value_name = 'value' )

# I reset the index because I couldn't get a join on a column and index, but there is probably a better way to do it
df_joined = pd.merge( df_unpiv, df_limits.transpose().reset_index().rename(columns = {'index':'feature'}), how='left', on ='feature' )
df_joined['abs diff from target'] = abs( df_joined['value'] - df_joined['target'] )

df_joined['outside range'] =  (df_joined['value'] < df_joined['min val'] ) | (df_joined['value'] > df_joined['max val'])
    
df_outside_range = df_joined.query(" `outside range` == True "  )
df_inside_range = df_joined.query(" `outside range` == False "  )

Approach方法

  • reshape重塑
  • merge合并
  • calculate计算
new_df = (df_to_check.set_index("ID").unstack().reset_index()
 .rename(columns={"level_0":"column",0:"value"})
 .merge(df_limits.T, left_on="column", right_index=True)
 .assign(deviate=lambda dfa: (dfa.value-dfa.target)/dfa.target)
)

column柱子 ID ID value价值 target目标 UL UL LL deviate偏离
feat_1壮举_1 123 123 12.5 12.5 12 12 15 15 9 9 0.0416667 0.0416667
feat_1壮举_1 456 456 18 18 12 12 15 15 9 9 0.5 0.5
feat_1壮举_1 789 789 9 9 12 12 15 15 9 9 -0.25 -0.25
feat_2壮举_2 123 123 9.6 9.6 9 9 10 10 8 8 0.0666667 0.0666667
feat_2壮举_2 456 456 3 3 9 9 10 10 8 8 -0.666667 -0.666667
feat_2壮举_2 789 789 11 11 9 9 10 10 8 8 0.222222 0.222222
feat_3壮举_3 123 123 100 100 90 90 120 120 60 60 0.111111 0.111111
feat_3壮举_3 456 456 100 100 90 90 120 120 60 60 0.111111 0.111111
feat_3壮举_3 789 789 100 100 90 90 120 120 60 60 0.111111 0.111111

I solved my issue maybe in bit clumsy way but it works as desired... If someone have better answer I will still appreciate:我可能以有点笨拙的方式解决了我的问题,但它可以按预期工作......如果有人有更好的答案,我仍然会很感激:

Example how to get only observations above limits, to have both just concatenate observation from UL_index and LL_index示例如何仅获得超出限制的观察结果,以将来自 UL_index 和 LL_index 的观察结果连接起来

def table(df_limits,df_to_check,column):
    
    above_limit = []
    df_above_limit = pd.DataFrame()
    
    UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
    LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
    
    UL_index = df_to_check.loc[df_to_check[column] > UL].index
    LL_index = df_to_check.loc[df_to_check[column] < LL].index
    
    df_to_check_UL = df_to_check.loc[UL_index]
    df_to_check_LL = df_to_check.loc[LL_index]
    
    above_limit = {
                    'ID': df_to_check_UL['ID'],
                    'feature value': df_to_check[column],
                    'target':  df_limits[column].loc['target']
    }
        
    df_above_limit = pd.DataFrame(above_limit, index = df_to_check_UL.index)
        
    return df_above_limit

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 熊猫创建新的数据框,从多个观测值中选择最大值 - Pandas create new dataframe choosing max value from multiple observations 创建由符合条件的现有数据框的特定行组成的新熊猫数据框的最佳方法是什么? - What is the best way to create new pandas dataframe consisting of specific rows of an existing dataframe that match criteria? 如何从pandas DataFrame中删除特定列的缺失值的观察? - How to remove observations with missing values for specific columns from pandas DataFrame? Python Pandas:如果groupby中任何前面的行中的值满足特定条件,则从数据框中删除一行 - Python Pandas: Eliminate a row from a dataframe if a value in a any preceding row in a groupby meets a certain criteria 从Pandas数据框中的特定行创建新列 - Create new column from specific rows in pandas dataframe Pandas Dataframe Yahoo Finance检查交易量是否符合标准 - Pandas Dataframe Yahoo Finance Checking if Volume Meets Criteria Pandas数据帧在满足双重标准时会更改值 - Pandas dataframe change values when it meets dual criteria 在 Pandas 中创建新列,这是列表中特定位置的值 - Create new column in pandas which is the value of specific location from list 如何查找 Pandas 中每一行的哪一列首先满足条件? - How to find which column meets a criteria first for each row in Pandas? 根据来自另一个数据框的 2 个条件创建新的数据框列 - Create new dataframe column based on 2 criteria from another dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM