简体   繁体   中英

How to conditionally slice a dataframe in pandas

Consider a pandas DataFrame constructed like:

df = pandas.DataFrame({'a':['one','two','three']})

then I can locate the specific row of the dataframe containing two like:

df[df.a == 'two']

but so far the only way I have found to subset the DataFrame up to this row is like:

df[:df[df.a == 'two'].index[0]]

but that is quite ugly, so:

Is there a more appropriate way to accomplish this subsetting?

Specifically, I am interested in how to slice the DataFrame between row indices where a given column matches some arbitrary text string (in this case 'two'). For this particular case it would be equivalent to df[:2] . In general however, the ability to locate an index for the start and/or end of a slice based on column values seems like a reasonable thing?

one last example, maybe will help; I would expect to be able to do something like this:

df[df.a == 'one' : df.a == 'three']

to get a slice containing rows 1 & 2 of the DataFrame, equivalent to df[0:3]

You want to identify the indices for a particular start and stop values and get the matching rows plus all the rows in between. One way is to find the indexes and build a range, but you already said that you don't like that approach. Here is a general solution using boolean logic that should work for you.

First, let's make a more interesting example:

import pandas as pd
df = pd.DataFrame({'a':['one','two','three', 'four', 'five']})

Suppose start = "two" and stop = "four" . That is, you want to get the following output DataFrame:

       a
1    two
2  three
3   four

We can find the index of the bounding rows via:

df["a"].isin({start, stop})
#0    False
#1     True
#2    False
#3     True
#4    False
#Name: a, dtype: bool

If the value for index 2 were True , we would be done as we could just use this output as a mask. So let's find a way to create the mask we need.

First we can use cummax() and the boolean XOR operator ( ^ ) to achieve:

(df["a"]==start).cummax() ^ (df["a"]==stop).cummax()
#0    False
#1     True
#2     True
#3    False
#4    False
#Name: a, dtype: bool

This is almost what we want, except we are missing the stop value index. So let's just bitwise OR ( | ) the stop condition:

#0    False
#1     True
#2     True
#3     True
#4    False
#Name: a, dtype: bool

This gets the result we are looking for. So create a mask, and index the dataframe:

mask = (df["a"]==start).cummax() ^ (df["a"]==stop).cummax() | (df["a"]==stop)
print(df[mask])
#       a
#1    two
#2  three
#3   four

We can extend these findings into a function that also supports indexing up to a row or indexing from a row to the end:

def get_rows(df, col, start, stop):
    if start is None:
        mask = ~((df[col] == stop).cummax() ^ (df[col] == stop))
    else:
        mask = (df[col]==start).cummax() ^ (df[col]==stop).cummax() | (df[col]==stop)
    return df[mask]

# get rows between "two" and "four" inclusive
print(get_rows(df=df, col="a", start="two", stop="four"))
#       a
#1    two
#2  three
#3   four

# get rows from "two" until the end
print(get_rows(df=df, col="a", start="two", stop=None))
#       a
#1    two
#2  three
#3   four
#4   five

# get rows up to "two"
print(get_rows(df=df, col="a", start=None, stop="two"))
#     a
#0  one
#1  two

Update :

For completeness, here is the indexing based solution.

def get_rows_indexing(df, col, start, stop):
    min_ind = min(df.index[df[col]==start].tolist() or [0])
    max_ind = max(df.index[df[col]==stop].tolist() or [len(df)])
    return df[min_ind:max_ind+1]

This function does essentially the same thing as the other version, but it may be easier to understand. Also this is more robust, as the other version relies on None not being a value in the desired column.

If you temorarily use column 'a' as an index, then the locate method (loc) does exactly what you are asking.

df = pd.DataFrame({'a':['one','two','three', 'four', 'five']})
start = 'two'
stop = 'four'
df = df.set_index('a').loc[start:stop].reset_index()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM