简体   繁体   中英

finding the index of the first row matching a condition in pandas

I understand I can do something like this:

df[df['data'] > 3].index.tolist()

and take the first element of the list

but the place I need to use it is in a loop with a lot of iterations and a very large dataframe. I want to get the first instance and stop the execution right there instead of wasting time to collect all instances to then discard all results but the first one.

Is there a way to do this with Pandas? manually iterating through the rows is crazy slow; splitting the dataframe into chunks and doing a search in each doesn't help that much (possibly because it does some copies, not sure).

edit: here's an example

data = {'data': [10, 11, 12, 14, 15, 16, 18]}   # this is over 1M entries in practice
df = pd.DataFrame.from_dict(data)
df.index[df['data']>14].tolist()[0]

this returns 4, as expected.

what I want is to find a fast way to stop execution the moment there is one row matching the condition.

idxmax

Still evaluates a boolean series prior to evaluating idxmax

df['data'].gt(3).idxmax()

argmax

df.index[(df['data'].to_numpy() > 3).argmax()]

explicit function

def find(s):
    for i, v in s.iteritems():
        if v > 3:
            return i

find(df['data'])

Numba

from numba import njit

@njit
def find(a, b, c):
    for x, y in zip(a, b):
        if y > c:
            return x

find(df.index.to_numpy(), df['data'].to_numpy(), 3)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM