Pandas pivot_table taking most recent value if the most recent value represents a certain percentage the values present

Question

I am trying to find values for certain IDs and codes in a massive data set, and I am trying to get to these by taking the most recently used value for each unique pair. I am currently just taking the most recently used code using the code below

data.head()
    ID      Code    value
15  13513   X2784   30.0
16  12665   X2744   65.0
17  16543   X2744   65.0
19  15761   X2100   29.0
21  14265   X2750   48.0

df = data.pivot_table(index='ID', columns='Code', values='value', aggfunc = 'first')

df.head()
ID      X2784   X2744   X2100   X2750
13271   30.0    65.0    29.0    35.0
16343   30.0    65.0    29.0    35.0
19342   30.0    65.0    29.0    35.0
15437   30.0    65.0    29.0    35.0
14359   30.0    65.0    29.0    48.0

The issue is that some of these values are wrong due to anomalies in the data. The idea would be to look at the most recent value, determine if it represents a certain percentage of all values for that pair, and then assign it. An example of the issue would be something like this:

data[(data['ID'] == '14359') & (data['Code'] == 'X2750')]['value'].value_counts()
35.0     2530
48.0        2

The value of 29.0 is the most recent occurrence, but it happens such a small percentage of times that it should be considered an anomaly. Is there any way to combine the pivot_table aggfunc "first" with some sort of threshold of occurrences?

Answer 1

If you are sure that the majority is always your wished value you could use the median aggregation to get the "middle" or "50% quantile" value. This would cut off all anomalies.

Try this function:

df = data.pivot_table(index='ID', columns='Code', values='value', aggfunc = 'first', aggfunc=np.median)

Answer 2

I was able to figure it out using a lambda function for the aggfunc

aggfunc = lambda x: x.iloc[0] if x.value_counts()[x.iloc[0]]/x.value_counts().sum() > .25 else x.mode(dropna = False).iat[0]

Thanks everyone for the help!

Pandas pivot_table taking most recent value if the most recent value represents a certain percentage the values present

Question

2 answers

solution1
0 2020-07-13 17:46:33

solution2
0 2020-07-13 21:55:04

Pandas pivot_table taking most recent value if the most recent value represents a certain percentage the values present

Question

2 answers

solution1 0 2020-07-13 17:46:33

solution2 0 2020-07-13 21:55:04

solution1
0 2020-07-13 17:46:33

solution2
0 2020-07-13 21:55:04