简体   繁体   中英

Filtering top-level categories in hierarchical Pandas Dataframe using lower-level data

I have a Pandas DataFrame that contains a large number of categories that each have features and each of those have their own subfeatures that are grouped into pairs. A simple version looks like the following:

                                        0         1    ...
categories features subfeatures                    
cat1       feature1 subfeature1 -0.224487 -0.227524
                    subfeature2 -0.591399 -0.799228
           feature2 subfeature1  1.190110 -1.365895    ...
                    subfeature2  0.720956 -1.325562
cat2       feature1 subfeature1  1.856932       NaN
                    subfeature2 -1.354258 -0.740473
           feature2 subfeature1  0.234075 -1.362235    ...
                    subfeature2  0.013875  1.309564
cat3       feature1 subfeature1       NaN       NaN
                    subfeature2 -1.260408  1.559721    ...
           feature2 subfeature1  0.419246  0.084386
                    subfeature2  0.969270  1.493417

...                    ...               ...

It can be generated using the following code:

import pandas as pd
import numpy as np

np.random.seed(seed=90)
results = np.random.randn(3,2,2,2)
results[2,0,0,:] = np.nan
results[1,0,0,1] = np.nan
results = results.reshape((-1,2))
index = pd.MultiIndex.from_product([["cat1", "cat2", "cat3"],
                                    ["feature1", "feature2"], 
                                    ["subfeature1", "subfeature2"]], 
                                   names=["categories", "features", "subfeatures"])
df = pd.DataFrame(results, index=index)

Now I would like to retrieve top-level categories ( cat1 etc) that have a difference between subfeature1 and subfeature2 in the same column ( 0 or 1 ) that is above a certain threshold.

For example: if the threshold is 1 then I would expect cat2 and cat3 to be returned because the difference between subfeature1 and subfeature2 in column 0 is 1.856932 - (-1.354258), which is 3.21119 > threshold = 1 for feature1 in cat2 . Similarly, the difference between subfeature1 and subfeature2 in column 1 in cat3 , feature2 is 1.493417 - 0.084386 = 1.409031 > 1. On the other hand, cat1 would not be returned because none the differences between subfeature pairs are greater than 1. NaN values would invalidate a pair and be ignored.

What I have tried

I have managed to implement an iterative approach, but I feel like I am not taking advantage of Pandas' full capabilities and its performance is lacking:

for cat in df.index.levels[0]:
    for feature in df.index.levels[1]:
        df2 = df.xs((cat, feature))
        diffs = abs(df2.loc['subfeature1'] - df2.loc['subfeature2'])
        if max(diffs) > threshold and cat not in results:
            results.append(cat)

yielding:

['cat2', 'cat3']

How could I go about implementing something like this using Pandas' built-in vectorized abilities?

EDIT: Using Jeff's answer below, I noticed something funky:

def f(x):
    a = max(abs(x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')))
    print a
    return a > 1

result = df.groupby(level=['categories','features']).filter(f)
print(result)

gives:

0.366912262765
0.571703714569
1
0.469153603312
0.0403331129905
3.2111900125 <------------------------------------------------
nan
0.220200012413
2.67179897269  <---------------------------------------------------
nan
nan
0.550023734074
1.40903094796  <-----------------------------------------------------!!!!!!!!!!!
                                        0         1
categories features subfeatures                    
cat2       feature1 subfeature1  1.856932       NaN
                    subfeature2 -1.354258 -0.740473

I've highlighted all the places where the algorithm should include a category based on the score. Yet, it doesn't for cat3 . Could the nans have something to do with it?

Groupby the top-2 levels. Then use a filter to only return the max difference of the features you want (threshold here is 0)

In [41]: df.groupby(level=['categories','features']).filter(lambda x: (x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')).max()>0)
Out[41]: 
                                        0         1
categories features subfeatures                    
cat1       feature1 subfeature1 -0.224487 -0.227524
                    subfeature2 -0.591399 -0.799228
           feature2 subfeature1  1.190110 -1.365895
                    subfeature2  0.720956 -1.325562
cat2       feature1 subfeature1  1.856932       NaN
                    subfeature2 -1.354258 -0.740473
           feature2 subfeature1  0.234075 -1.362235
                    subfeature2  0.013875  1.309564

A useful debugging aid to to do something like this:

def f(x):
    print x
    return (x.xs(......)) # e.g. the filter from above

df.groupby(.....).filter(f)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM