简体   繁体   中英

Pandas MultiIndex DataFrame reference index value in column calculation

I want to efficiently use values from a DataFrame's MultiIndex in some calculations. For example, starting with:

np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in random.sample(pd.date_range('2017-01-01', periods=50, freq='W').tolist(), 5)]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values

Suppose I want to calculate a new column Diff = Num - SmallestNum . An efficient but, I assume, kludgy way is to copy the Index level I want to reference into a bona fide column and then do the difference:

df['NumCol'] = df.index.get_level_values(1)
df['Diff'] = df['NumCol'] - df['SmallestNum']

But I feel like I'm still not understanding the proper way to work with DataFrames if I'm doing this. I thought the "correct" solution would look like either of the following, which don't create and store a full copy of the index values:

df['Diff'] = df.transform(lambda x: x.index.get_level_values(1) - x['SmallestNum'])
df['Diff'] = df.reset_index(level=1).apply(lambda x: x['Num'] - x['SmallestNum'])

... however not only do neither of these expressions work*, but also my understanding is that DataFrame operations like .transform or .apply are bound to be significantly slower than ones that operate on explicit "vectorized" row references.

So what is the "correct and efficient" way to write the calculation for the new Diff column in this example?


* Update: This problem was compounded by the fact (possibly bug) that the index level 1 values were not unique, which causes formulas that work when the index values are unique to fail with NotImplementedError: Index._join_level on non-unique index is not implemented . Fortunately jezrael's answer contains workarounds that appear to be as efficient as explicitly vectorized calculation.

I think you need simply subtract:

df['Diff'] = df.index.get_level_values(1) - df['SmallestNum']
print (df)

              Vals  SmallestNum  Diff
Name Num                             
A    28   1.180140           28     0
     44   0.984257           28    16
     90   1.835646           28    62
     43  -1.886823           28    15
     29   0.424763           28     1
B    80  -0.433105           38    42
     61  -0.166838           38    23
     46   0.754634           38     8
     38   1.966975           38     0
     93   0.200671           38    55
C    40   0.742752           12    28
     82  -1.264271           12    70
     12  -0.112787           12     0
     78   0.667358           12    66
     70   0.357900           12    58

EDIT: for non unique DatetimeIndex in second level working subtract numpy arrays created by values :

np.random.seed(456)
a = pd.date_range('2015-01-01', periods=6).values
j = [['A'] * 5 + ['B'] * 5 + ['C'] * 5, pd.to_datetime(np.random.choice(a, size=15))]
i = pd.MultiIndex.from_arrays(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values
df['Diff'] = df.index.get_level_values(1).values - df['SmallestNum'].values
print (df)
                     Vals SmallestNum   Diff
Name Num                                    
A    2015-01-04 -1.842419  2015-01-02 2 days
     2015-01-06 -0.786788  2015-01-02 4 days
     2015-01-04  1.180140  2015-01-02 2 days
     2015-01-02  0.984257  2015-01-02 0 days
     2015-01-03  1.835646  2015-01-02 1 days
B    2015-01-05 -1.886823  2015-01-03 2 days
     2015-01-03  0.424763  2015-01-03 0 days
     2015-01-05 -0.433105  2015-01-03 2 days
     2015-01-06 -0.166838  2015-01-03 3 days
     2015-01-05  0.754634  2015-01-03 2 days
C    2015-01-06  1.966975  2015-01-02 4 days
     2015-01-06  0.200671  2015-01-02 4 days
     2015-01-05  0.742752  2015-01-02 3 days
     2015-01-02 -1.264271  2015-01-02 0 days
     2015-01-04 -0.112787  2015-01-02 2 days

Another solution:

df['Diff'] = (df.reset_index(level=1)
                .groupby('Name')['Num']
                .transform(lambda x: x - x.min())
                .values)
print (df)
                     Vals   Diff
Name Num                        
A    2015-01-04 -1.842419 2 days
     2015-01-06 -0.786788 4 days
     2015-01-04  1.180140 2 days
     2015-01-02  0.984257 0 days
     2015-01-03  1.835646 1 days
B    2015-01-05 -1.886823 2 days
     2015-01-03  0.424763 0 days
     2015-01-05 -0.433105 2 days
     2015-01-06 -0.166838 3 days
     2015-01-05  0.754634 2 days
C    2015-01-06  1.966975 4 days
     2015-01-06  0.200671 4 days
     2015-01-05  0.742752 3 days
     2015-01-02 -1.264271 0 days
     2015-01-04 -0.112787 2 days

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM