简体   繁体   中英

Get index number from multi-index dataframe in python

There seems to be a lot of answers on how to get last index value from pandas dataframe but what I am trying to get index position number for the last row of every index at level 0 in a multi-index dataframe. I found a way using a loop but the data frame is millions of line and this loop is slow. I assume there is a more pythonic way of doing this.

Here is a mini example of df3. I want to get a list (or maybe an array) of the numbers in the index for the df >> the last row before it changes to a new stock. The index column is the results I want. this is the index position from the df

Stock   Date      Index 
AAPL    12/31/2004  
        1/3/2005    
        1/4/2005    
        1/5/2005    
        1/6/2005    
        1/7/2005    
        1/10/2005   3475
AMZN    12/31/2004  
        1/3/2005    
        1/4/2005    
        1/5/2005    
        1/6/2005    
        1/7/2005    
        1/10/2005   6951
BAC     12/31/2004  
        1/3/2005    
        1/4/2005    
        1/5/2005    
        1/6/2005    
        1/7/2005    
       1/10/2005    10427

This is the code I am using, where df3 in the dataframe

test_index_list = []
for start_index in range(len(df3)-1):
    end_index = start_index + 1
    if df3.index[start_index][0] != df3.index[end_index][0]:
       test_index_list.append(start_index)

I change divakar answer a bit with get_level_values for indices of first level of MultiIndex :

df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbc')}).set_index(['F','A','B'])

print (df)
       C  D  E
F A B         
a a 4  7  1  5
  b 5  8  3  3
  c 4  9  5  6
b d 5  4  7  9
  e 5  2  1  2
c f 4  3  0  4

def start_stop_arr(initial_list):
    a = np.asarray(initial_list)
    mask = np.concatenate(([True], a[1:] != a[:-1], [True]))
    idx = np.flatnonzero(mask)
    stop = idx[1:]-1
    return stop

print (df.index.get_level_values(0))
Index(['a', 'a', 'a', 'b', 'b', 'c'], dtype='object', name='F')

print (start_stop_arr(df.index.get_level_values(0)))
[2 4 5]

dict.values

Using dict to track values leaves the last found value as the one that matters.

list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())

[2, 4, 5]

With Loop

Create function that takes a factorization and number of unique values

def last(bins, k):
    a = np.zeros(k, np.int64)
    for i, b in enumerate(bins):
        a[b] = i
    return a

You can then get the factorization with

f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))

array([2, 4, 5])

However, the way MultiIndex is usually constructed, the labels objects are already factorizations and the levels objects are unique values.

last(df.index.labels[0], df.index.levels[0].size)

array([2, 4, 5])

What's more is that we can use Numba to use just in time compiling to super-charge this.

from numba import njit

@njit
def nlast(bins, k):
    a = np.zeros(k, np.int64)
    for i, b in enumerate(bins):
        a[b] = i
    return a

nlast(df.index.labels[0], df.index.levels[0].size)

array([2, 4, 5])

Timing

%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))

641 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
nlast(f, len(u))

264 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
nlast(df.index.labels[0], len(df.index.levels[0]))

4.06 µs ± 43.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
last(df.index.labels[0], len(df.index.levels[0]))

654 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())

709 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

jezrael's solution. Also very fast.

%timeit start_stop_arr(df.index.get_level_values(0))

113 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

np.unique

I did not time this because I don't like it. See below:

Using np.unique and the return_index argument. This returns the first place each unique value is found. After this, I'd do some shifting to get at the last position of the prior unique value.

Note : this works if the level values are in contiguous groups. If they aren't, we have to do sorting and unsorting that isn't worth it. Unless it really is then I'll show how to do it.

i = np.unique(df.index.get_level_values(0), return_index=True)[1]
np.append(i[1:], len(df)) - 1

array([2, 4, 5])

Setup

from @jezrael

df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbc')}).set_index(['F','A','B'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM