There seems to be a lot of answers on how to get last index value from pandas dataframe but what I am trying to get index position number for the last row of every index at level 0 in a multi-index dataframe. I found a way using a loop but the data frame is millions of line and this loop is slow. I assume there is a more pythonic way of doing this.
Here is a mini example of df3. I want to get a list (or maybe an array) of the numbers in the index for the df >> the last row before it changes to a new stock. The index column is the results I want. this is the index position from the df
Stock Date Index
AAPL 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 3475
AMZN 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 6951
BAC 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 10427
This is the code I am using, where df3 in the dataframe
test_index_list = []
for start_index in range(len(df3)-1):
end_index = start_index + 1
if df3.index[start_index][0] != df3.index[end_index][0]:
test_index_list.append(start_index)
I change divakar answer a bit with get_level_values
for indices of first level of MultiIndex
:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])
print (df)
C D E
F A B
a a 4 7 1 5
b 5 8 3 3
c 4 9 5 6
b d 5 4 7 9
e 5 2 1 2
c f 4 3 0 4
def start_stop_arr(initial_list):
a = np.asarray(initial_list)
mask = np.concatenate(([True], a[1:] != a[:-1], [True]))
idx = np.flatnonzero(mask)
stop = idx[1:]-1
return stop
print (df.index.get_level_values(0))
Index(['a', 'a', 'a', 'b', 'b', 'c'], dtype='object', name='F')
print (start_stop_arr(df.index.get_level_values(0)))
[2 4 5]
dict.values
Using dict
to track values leaves the last found value as the one that matters.
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
[2, 4, 5]
Create function that takes a factorization and number of unique values
def last(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
You can then get the factorization with
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
array([2, 4, 5])
However, the way MultiIndex
is usually constructed, the labels
objects are already factorizations and the levels
objects are unique values.
last(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
What's more is that we can use Numba to use just in time compiling to super-charge this.
from numba import njit
@njit
def nlast(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
nlast(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
641 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
nlast(f, len(u))
264 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
nlast(df.index.labels[0], len(df.index.levels[0]))
4.06 µs ± 43.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
last(df.index.labels[0], len(df.index.levels[0]))
654 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
709 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jezrael's solution. Also very fast.
%timeit start_stop_arr(df.index.get_level_values(0))
113 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.unique
I did not time this because I don't like it. See below:
Using np.unique
and the return_index
argument. This returns the first place each unique value is found. After this, I'd do some shifting to get at the last position of the prior unique value.
Note : this works if the level values are in contiguous groups. If they aren't, we have to do sorting and unsorting that isn't worth it. Unless it really is then I'll show how to do it.
i = np.unique(df.index.get_level_values(0), return_index=True)[1]
np.append(i[1:], len(df)) - 1
array([2, 4, 5])
from @jezrael
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.