I have a Pandas DataFrame that contains a large number of categories that each have features and each of those have their own subfeatures that are grouped into pairs. A simple version looks like the following:
0 1 ...
categories features subfeatures
cat1 feature1 subfeature1 -0.224487 -0.227524
subfeature2 -0.591399 -0.799228
feature2 subfeature1 1.190110 -1.365895 ...
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235 ...
subfeature2 0.013875 1.309564
cat3 feature1 subfeature1 NaN NaN
subfeature2 -1.260408 1.559721 ...
feature2 subfeature1 0.419246 0.084386
subfeature2 0.969270 1.493417
... ... ...
It can be generated using the following code:
import pandas as pd
import numpy as np
np.random.seed(seed=90)
results = np.random.randn(3,2,2,2)
results[2,0,0,:] = np.nan
results[1,0,0,1] = np.nan
results = results.reshape((-1,2))
index = pd.MultiIndex.from_product([["cat1", "cat2", "cat3"],
["feature1", "feature2"],
["subfeature1", "subfeature2"]],
names=["categories", "features", "subfeatures"])
df = pd.DataFrame(results, index=index)
Now I would like to retrieve top-level categories ( cat1
etc) that have a difference between subfeature1
and subfeature2
in the same column ( 0
or 1
) that is above a certain threshold.
For example: if the threshold is 1 then I would expect cat2
and cat3
to be returned because the difference between subfeature1
and subfeature2
in column 0
is 1.856932 - (-1.354258), which is 3.21119 > threshold = 1 for feature1
in cat2
. Similarly, the difference between subfeature1
and subfeature2
in column 1
in cat3
, feature2
is 1.493417 - 0.084386 = 1.409031 > 1. On the other hand, cat1
would not be returned because none the differences between subfeature pairs are greater than 1. NaN
values would invalidate a pair and be ignored.
I have managed to implement an iterative approach, but I feel like I am not taking advantage of Pandas' full capabilities and its performance is lacking:
for cat in df.index.levels[0]:
for feature in df.index.levels[1]:
df2 = df.xs((cat, feature))
diffs = abs(df2.loc['subfeature1'] - df2.loc['subfeature2'])
if max(diffs) > threshold and cat not in results:
results.append(cat)
yielding:
['cat2', 'cat3']
How could I go about implementing something like this using Pandas' built-in vectorized abilities?
EDIT: Using Jeff's answer below, I noticed something funky:
def f(x):
a = max(abs(x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')))
print a
return a > 1
result = df.groupby(level=['categories','features']).filter(f)
print(result)
gives:
0.366912262765
0.571703714569
1
0.469153603312
0.0403331129905
3.2111900125 <------------------------------------------------
nan
0.220200012413
2.67179897269 <---------------------------------------------------
nan
nan
0.550023734074
1.40903094796 <-----------------------------------------------------!!!!!!!!!!!
0 1
categories features subfeatures
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
I've highlighted all the places where the algorithm should include a category based on the score. Yet, it doesn't for cat3
. Could the nans have something to do with it?
Groupby the top-2 levels. Then use a filter to only return the max difference of the features you want (threshold here is 0)
In [41]: df.groupby(level=['categories','features']).filter(lambda x: (x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')).max()>0)
Out[41]:
0 1
categories features subfeatures
cat1 feature1 subfeature1 -0.224487 -0.227524
subfeature2 -0.591399 -0.799228
feature2 subfeature1 1.190110 -1.365895
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235
subfeature2 0.013875 1.309564
A useful debugging aid to to do something like this:
def f(x):
print x
return (x.xs(......)) # e.g. the filter from above
df.groupby(.....).filter(f)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.