I want to efficiently use values from a DataFrame's MultiIndex in some calculations. For example, starting with:
np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in random.sample(pd.date_range('2017-01-01', periods=50, freq='W').tolist(), 5)]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values
Suppose I want to calculate a new column Diff = Num - SmallestNum
. An efficient but, I assume, kludgy way is to copy the Index level I want to reference into a bona fide column and then do the difference:
df['NumCol'] = df.index.get_level_values(1)
df['Diff'] = df['NumCol'] - df['SmallestNum']
But I feel like I'm still not understanding the proper way to work with DataFrames if I'm doing this. I thought the "correct" solution would look like either of the following, which don't create and store a full copy of the index values:
df['Diff'] = df.transform(lambda x: x.index.get_level_values(1) - x['SmallestNum'])
df['Diff'] = df.reset_index(level=1).apply(lambda x: x['Num'] - x['SmallestNum'])
... however not only do neither of these expressions work*, but also my understanding is that DataFrame operations like .transform
or .apply
are bound to be significantly slower than ones that operate on explicit "vectorized" row references.
So what is the "correct and efficient" way to write the calculation for the new Diff
column in this example?
* Update: This problem was compounded by the fact (possibly bug) that the index level 1 values were not unique, which causes formulas that work when the index values are unique to fail with NotImplementedError: Index._join_level on non-unique index is not implemented
. Fortunately jezrael's answer contains workarounds that appear to be as efficient as explicitly vectorized calculation.
I think you need simply subtract:
df['Diff'] = df.index.get_level_values(1) - df['SmallestNum']
print (df)
Vals SmallestNum Diff
Name Num
A 28 1.180140 28 0
44 0.984257 28 16
90 1.835646 28 62
43 -1.886823 28 15
29 0.424763 28 1
B 80 -0.433105 38 42
61 -0.166838 38 23
46 0.754634 38 8
38 1.966975 38 0
93 0.200671 38 55
C 40 0.742752 12 28
82 -1.264271 12 70
12 -0.112787 12 0
78 0.667358 12 66
70 0.357900 12 58
EDIT: for non unique DatetimeIndex
in second level working subtract numpy arrays created by values
:
np.random.seed(456)
a = pd.date_range('2015-01-01', periods=6).values
j = [['A'] * 5 + ['B'] * 5 + ['C'] * 5, pd.to_datetime(np.random.choice(a, size=15))]
i = pd.MultiIndex.from_arrays(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values
df['Diff'] = df.index.get_level_values(1).values - df['SmallestNum'].values
print (df)
Vals SmallestNum Diff
Name Num
A 2015-01-04 -1.842419 2015-01-02 2 days
2015-01-06 -0.786788 2015-01-02 4 days
2015-01-04 1.180140 2015-01-02 2 days
2015-01-02 0.984257 2015-01-02 0 days
2015-01-03 1.835646 2015-01-02 1 days
B 2015-01-05 -1.886823 2015-01-03 2 days
2015-01-03 0.424763 2015-01-03 0 days
2015-01-05 -0.433105 2015-01-03 2 days
2015-01-06 -0.166838 2015-01-03 3 days
2015-01-05 0.754634 2015-01-03 2 days
C 2015-01-06 1.966975 2015-01-02 4 days
2015-01-06 0.200671 2015-01-02 4 days
2015-01-05 0.742752 2015-01-02 3 days
2015-01-02 -1.264271 2015-01-02 0 days
2015-01-04 -0.112787 2015-01-02 2 days
Another solution:
df['Diff'] = (df.reset_index(level=1)
.groupby('Name')['Num']
.transform(lambda x: x - x.min())
.values)
print (df)
Vals Diff
Name Num
A 2015-01-04 -1.842419 2 days
2015-01-06 -0.786788 4 days
2015-01-04 1.180140 2 days
2015-01-02 0.984257 0 days
2015-01-03 1.835646 1 days
B 2015-01-05 -1.886823 2 days
2015-01-03 0.424763 0 days
2015-01-05 -0.433105 2 days
2015-01-06 -0.166838 3 days
2015-01-05 0.754634 2 days
C 2015-01-06 1.966975 4 days
2015-01-06 0.200671 4 days
2015-01-05 0.742752 3 days
2015-01-02 -1.264271 0 days
2015-01-04 -0.112787 2 days
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.