I am trying to find values for certain IDs and codes in a massive data set, and I am trying to get to these by taking the most recently used value for each unique pair. I am currently just taking the most recently used code using the code below
data.head()
ID Code value
15 13513 X2784 30.0
16 12665 X2744 65.0
17 16543 X2744 65.0
19 15761 X2100 29.0
21 14265 X2750 48.0
df = data.pivot_table(index='ID', columns='Code', values='value', aggfunc = 'first')
df.head()
ID X2784 X2744 X2100 X2750
13271 30.0 65.0 29.0 35.0
16343 30.0 65.0 29.0 35.0
19342 30.0 65.0 29.0 35.0
15437 30.0 65.0 29.0 35.0
14359 30.0 65.0 29.0 48.0
The issue is that some of these values are wrong due to anomalies in the data. The idea would be to look at the most recent value, determine if it represents a certain percentage of all values for that pair, and then assign it. An example of the issue would be something like this:
data[(data['ID'] == '14359') & (data['Code'] == 'X2750')]['value'].value_counts()
35.0 2530
48.0 2
The value of 29.0 is the most recent occurrence, but it happens such a small percentage of times that it should be considered an anomaly. Is there any way to combine the pivot_table aggfunc "first" with some sort of threshold of occurrences?
If you are sure that the majority is always your wished value you could use the median aggregation to get the "middle" or "50% quantile" value. This would cut off all anomalies.
Try this function:
df = data.pivot_table(index='ID', columns='Code', values='value', aggfunc = 'first', aggfunc=np.median)
I was able to figure it out using a lambda function for the aggfunc
aggfunc = lambda x: x.iloc[0] if x.value_counts()[x.iloc[0]]/x.value_counts().sum() > .25 else x.mode(dropna = False).iat[0]
Thanks everyone for the help!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.