简体   繁体   中英

Filter out all rows preceding a string match

I'm attempting to get DataFrame to discard all rows that precedes the place where there is a string match in one of the columns.

In other words: The row with the string match, and all rows after it, should be kept. (Column headers should also be kept).

import pandas as pd

df = pd.read_csv(file_path)

test_string = "myUniqueMatch"    
found_match = df["Column"].str.contains(test_string).sum()

if found_match == 1:
    match_location = df[df["Column"].str.contains(test_string)].index.tolist()
    df = df.iloc[match_location]

My probably superfluous code above will find the index location of the first match (assuming there is only one possible match).

The last line of code is a placeholder. I would here like to get all rows including and following match_location . How?

Ideally, if there are multiple matches, the first row to be kept is where the first match occurs.

If you want to just select starting with the first match, you could just pick the first matching index and slice accordingly - the below does not depend on the index value in case the index is non-unique:

df.iloc[df['strings'].tolist().index(test_string):]

The fix for your code would also simply be to use slicing :

df = df.iloc[match_location:]

The above is quick:

df = pd.DataFrame(np.random.choice(list('ABCDE'), 100), columns=['strings'])
test_string = 'A'

%timeit df.iloc[df['strings'].tolist().index(test_string):]
10000 loops, best of 3: 95 µs per loop

%timeit df.iloc[np.flatnonzero(df['strings'].str.contains('A'))[0]:]
1000 loops, best of 3: 299 µs per loop

%timeit df.loc[df['strings'].str.contains('A').cumsum().astype(bool)]
1000 loops, best of 3: 516 µs per loop

I initially misread - the below keeps the row with the match and the one immediately below, keeping in case useful for anyone: To select ALL lines that match and all lines immediately succeeding those, you could use .shift() and pd.Index.union along these lines:

df.loc[df[df['strings'].str.contains(test_string)].index.union(df[df['strings'].str.contains(test_string).shift().fillna(False)].index)]

Sample data:

df = pd.DataFrame(np.random.choice(list('ABCDE'), 100), columns=['strings'])
df.head()

  strings
0       B
1       A
2       B
3       E
4       D
5       C
6       E
7       D
8       D
9       D

test_string = 'A'
df.loc[df[df['strings'].str.contains(test_string)].index.union(df[df['strings'].str.contains(test_string).shift().fillna(False)].index)] 

to get:

   strings
1        A
2        B
11       A
12       A
13       D
18       A
19       C
36       A
37       C
42       A
43       E
44       A
45       C
51       A
52       B
56       A
57       A
58       A
59       C
62       A
63       D
69       A
70       E
73       A
74       E
96       A
97       A
98       B

You could use cumsum().astype(bool) to create a boolean selection mask:

import pandas as pd
df = pd.DataFrame({'col' : ['AA', 'AB', 'BA', 'BB', 'XX', 'AA', 'AB', 'XX', 'BA', 'BB']},
                  index=[1,2]*5)

mask = df['col'].str.contains(r'XX').cumsum().astype(bool)
print(df.loc[mask])

yields

  col
1  XX
2  AA
1  AB
2  XX
1  BA
2  BB

This works because cumsum treats True as equal to 1 and False as equal to 0.


Alternatively, you could use np.flatnonzero to find the ordinal index of the first True value:

In [73]: df.iloc[np.flatnonzero(df['col'].str.contains(r'XX'))[0]:]
Out[73]: 
  col
1  XX
2  AA
1  AB
2  XX
1  BA
2  BB

The works because flatnonzero treats False as equal to zero and True as a non-zero value.

This is a bit faster for large DataFrames since it avoids cumulative summing and conversion of ints back too boolean values:

In [84]: df = pd.DataFrame({'col' : ['AA', 'AB', 'BA', 'BB', 'XX', 'AA', 'AB', 'XX', 'BA', 'BB']}, index=[1,2]*5)

In [85]: df = pd.concat([df]*10000)

In [86]: %timeit df.loc[df['col'].str.contains(r'XX').cumsum().astype(bool)]
10 loops, best of 3: 46 ms per loop

In [87]: %timeit df.iloc[np.flatnonzero(df['col'].str.contains(r'XX'))[0]:]
10 loops, best of 3: 43.5 ms per loop

Both of the methods above avoid reliance on the index value , just in case the index is non-unique.

EDIT: Disregard, I misread and thought you were trying to discard every row preceding ones with matches. Anyways, if that is what you want, this is your code

import pandas as pd

df = pd.read_csv(file_path)
test_string = "myUniqueMatch"

mask = df["Column"].str.contains(test_string).shift(-1).fillna(False)
newDf = df.loc[~mask]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM