简体   繁体   English

Pandas pivot_table 取最近的值,如果最近的值代表某个百分比的值存在

[英]Pandas pivot_table taking most recent value if the most recent value represents a certain percentage the values present

I am trying to find values for certain IDs and codes in a massive data set, and I am trying to get to these by taking the most recently used value for each unique pair.我试图在海量数据集中查找某些 ID 和代码的值,并且我试图通过为每个唯一对获取最近使用的值来获取这些值。 I am currently just taking the most recently used code using the code below我目前只是使用下面的代码获取最近使用的代码

data.head()
    ID      Code    value
15  13513   X2784   30.0
16  12665   X2744   65.0
17  16543   X2744   65.0
19  15761   X2100   29.0
21  14265   X2750   48.0

df = data.pivot_table(index='ID', columns='Code', values='value', aggfunc = 'first')

df.head()
ID      X2784   X2744   X2100   X2750
13271   30.0    65.0    29.0    35.0
16343   30.0    65.0    29.0    35.0
19342   30.0    65.0    29.0    35.0
15437   30.0    65.0    29.0    35.0
14359   30.0    65.0    29.0    48.0

The issue is that some of these values are wrong due to anomalies in the data.问题是由于数据异常,其中一些值是错误的。 The idea would be to look at the most recent value, determine if it represents a certain percentage of all values for that pair, and then assign it.想法是查看最近的值,确定它是否代表该对的所有值的某个百分比,然后分配它。 An example of the issue would be something like this:这个问题的一个例子是这样的:

data[(data['ID'] == '14359') & (data['Code'] == 'X2750')]['value'].value_counts()
35.0     2530
48.0        2

The value of 29.0 is the most recent occurrence, but it happens such a small percentage of times that it should be considered an anomaly. 29.0 的值是最近发生的,但它发生的次数非常少,应该被视为异常。 Is there any way to combine the pivot_table aggfunc "first" with some sort of threshold of occurrences?有什么方法可以将 pivot_table aggfunc “first”与某种出现的阈值结合起来?

If you are sure that the majority is always your wished value you could use the median aggregation to get the "middle" or "50% quantile" value.如果您确定多数始终是您希望的值,您可以使用中值聚合来获得“中间”或“50% 分位数”值。 This would cut off all anomalies.这将切断所有异常。

Try this function:试试这个 function:

df = data.pivot_table(index='ID', columns='Code', values='value', aggfunc = 'first', aggfunc=np.median)

I was able to figure it out using a lambda function for the aggfunc我能够使用 aggfunc 的 lambda function 弄清楚

aggfunc = lambda x: x.iloc[0] if x.value_counts()[x.iloc[0]]/x.value_counts().sum() > .25 else x.mode(dropna = False).iat[0]

Thanks everyone for the help!感谢大家的帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM