简体   繁体   中英

Dropping rows in dataframe based on row value

I have a few word documents which i turned into strings before reading into dataframes. Each dataframe is only one column wide but many rows long. they all look something like this:

0| this document is a survey
1| please fill in fully
2| Send back to address on the bottom of the sheet
etc....

The start of each dataframe is fully of gibberish which i don't need so i need to delete all the rows before the row which contains the value 'Questions'. However it doesn't lie on the same index for each dataframe so i can't just delete the first 20 rows because it will have a different affect on each dataframe.

how could i delete all the rows before 'Questions' in each dataframe

Assuming you only need to keep rows after the first occurrence of 'Questions', then this approach should do the trick:

Dummy Data and Setup

import pandas as pd

data = {
    'x': [
          'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'
    ]
}

df = pd.DataFrame(data)
df

Output:

    x
0   a
1   b
2   c
3   d
4   e
5   f
6   g
7   h
8   i
9   j
10  k

Solution

Here I'll keep all rows after the first occurrence of an entry that starts with the letter 'f':

df[df.x.str.startswith('f').cumsum() > 0]

Output:

    x
5   f
6   g
7   h
8   i
9   j
10  k

Explanation

The solution relies on two main pandas features:

  1. pd.DataFrame().str.startswith , to get a boolean array with True for any cell that starts with a given string ('f' in this example but 'Questions' will also work).
  2. cumsum() which will cast boolean values to integers and so ensure that all rows after the first occurrence are greater than zero.

By using these to index the original dataframe, we obtain the solution.

Another alternative is to use str.contains() . Using a toy pandas Series:

import pandas as pd

# create dataframe
d = ["nothing", "target is here", "help", "more_words"]
df = pd.Series(data=d)

In the instance that you wanted to keep all rows (inclusive) after a word, say "here", you could do so by:

# check rows to determine whether they contain "here"
keyword_bool = df.str.contains("here", regex=False) 
# return index as int
idx = keyword_bool[keyword_bool==True].index[0] 

# slice dataframe
df = df.iloc[idx:]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM