Filter out all rows preceding a string match

Question

I'm attempting to get DataFrame to discard all rows that precedes the place where there is a string match in one of the columns.

In other words: The row with the string match, and all rows after it, should be kept. (Column headers should also be kept).

import pandas as pd

df = pd.read_csv(file_path)

test_string = "myUniqueMatch"    
found_match = df["Column"].str.contains(test_string).sum()

if found_match == 1:
    match_location = df[df["Column"].str.contains(test_string)].index.tolist()
    df = df.iloc[match_location]

My probably superfluous code above will find the index location of the first match (assuming there is only one possible match).

The last line of code is a placeholder. I would here like to get all rows including and following match_location . How?

Ideally, if there are multiple matches, the first row to be kept is where the first match occurs.

Answer 1

If you want to just select starting with the first match, you could just pick the first matching index and slice accordingly - the below does not depend on the index value in case the index is non-unique:

df.iloc[df['strings'].tolist().index(test_string):]

The fix for your code would also simply be to use slicing :

df = df.iloc[match_location:]

The above is quick:

df = pd.DataFrame(np.random.choice(list('ABCDE'), 100), columns=['strings'])
test_string = 'A'

%timeit df.iloc[df['strings'].tolist().index(test_string):]
10000 loops, best of 3: 95 µs per loop

%timeit df.iloc[np.flatnonzero(df['strings'].str.contains('A'))[0]:]
1000 loops, best of 3: 299 µs per loop

%timeit df.loc[df['strings'].str.contains('A').cumsum().astype(bool)]
1000 loops, best of 3: 516 µs per loop

I initially misread - the below keeps the row with the match and the one immediately below, keeping in case useful for anyone: To select ALL lines that match and all lines immediately succeeding those, you could use .shift() and pd.Index.union along these lines:

df.loc[df[df['strings'].str.contains(test_string)].index.union(df[df['strings'].str.contains(test_string).shift().fillna(False)].index)]

Sample data:

df = pd.DataFrame(np.random.choice(list('ABCDE'), 100), columns=['strings'])
df.head()

  strings
0       B
1       A
2       B
3       E
4       D
5       C
6       E
7       D
8       D
9       D

test_string = 'A'
df.loc[df[df['strings'].str.contains(test_string)].index.union(df[df['strings'].str.contains(test_string).shift().fillna(False)].index)]

to get:

Answer 2

You could use cumsum().astype(bool) to create a boolean selection mask:

import pandas as pd
df = pd.DataFrame({'col' : ['AA', 'AB', 'BA', 'BB', 'XX', 'AA', 'AB', 'XX', 'BA', 'BB']},
                  index=[1,2]*5)

mask = df['col'].str.contains(r'XX').cumsum().astype(bool)
print(df.loc[mask])

yields

  col
1  XX
2  AA
1  AB
2  XX
1  BA
2  BB

This works because cumsum treats True as equal to 1 and False as equal to 0.

Alternatively, you could use np.flatnonzero to find the ordinal index of the first True value:

In [73]: df.iloc[np.flatnonzero(df['col'].str.contains(r'XX'))[0]:]
Out[73]: 
  col
1  XX
2  AA
1  AB
2  XX
1  BA
2  BB

The works because flatnonzero treats False as equal to zero and True as a non-zero value.

This is a bit faster for large DataFrames since it avoids cumulative summing and conversion of ints back too boolean values:

In [84]: df = pd.DataFrame({'col' : ['AA', 'AB', 'BA', 'BB', 'XX', 'AA', 'AB', 'XX', 'BA', 'BB']}, index=[1,2]*5)

In [85]: df = pd.concat([df]*10000)

In [86]: %timeit df.loc[df['col'].str.contains(r'XX').cumsum().astype(bool)]
10 loops, best of 3: 46 ms per loop

In [87]: %timeit df.iloc[np.flatnonzero(df['col'].str.contains(r'XX'))[0]:]
10 loops, best of 3: 43.5 ms per loop

Both of the methods above avoid reliance on the index value , just in case the index is non-unique.

Answer 3

EDIT: Disregard, I misread and thought you were trying to discard every row preceding ones with matches. Anyways, if that is what you want, this is your code

import pandas as pd

df = pd.read_csv(file_path)
test_string = "myUniqueMatch"

mask = df["Column"].str.contains(test_string).shift(-1).fillna(False)
newDf = df.loc[~mask]

Filter out all rows preceding a string match

Question

3 answers

solution1
3 ACCPTED 2016-06-17 19:11:27

solution2
2 2016-06-17 19:12:38

solution3
1 2016-06-17 19:18:20

Filter out all rows preceding a string match

Question

3 answers

solution1 3 ACCPTED 2016-06-17 19:11:27

solution2 2 2016-06-17 19:12:38

solution3 1 2016-06-17 19:18:20

solution1
3 ACCPTED 2016-06-17 19:11:27

solution2
2 2016-06-17 19:12:38

solution3
1 2016-06-17 19:18:20