简体   繁体   中英

How can I find a specific substring in a Pandas DataFrame, and then get the text after it?

So I have a Pandas dataframe that I am getting from an html webpage. The dataframe is ONLY 1 column and that column has no identifying name. I want to find a specific substring from within the dataframe, and then get the text immediately following that substring.

Note: there will NEVER be repeats in the substring search.
Eg: there will NEVER be 2 instances of School 2:

The dataframe is formatted like this:

School 1: 1 Hour Delay
School 2: 2 Hour Delay
School 3: Closed

I want to be able to search for School 3: and then return the status, whether it be closed, 1 hour delay, or 2 hour delay.

My initial thought was just if "School 3:" in df print("School 3: found") But I just get an error from that, I'm assuming because you can't just check for a string like that. If anyone knows how to find a substring and then get the text after it I would love to know.

Assuming exactly one row always matches this condition, you can use str.extract :

df.iloc[:,0].str.extract('(?<=School 3: )(.*)', expand=False).dropna().values[0]
# 'Closed'

(Note: if more than one row matches this condition, only the status of the first match is returned.)

Otherwise, if it is possible nothing matches, you will need a try-except:

try:
    status = (df.iloc[:,0]
                .str.extract('(?<=School 3: )(.*)', expand=False)
                .dropna()
                .values[0])    
except (IndexError, ValueError):
    status = np.nan

Supposed the dataframe looks like

                   status
0  School 1: 1 Hour Delay
1  School 2: 2 Hour Delay
2        School 3: Closed

you could do

txt = 'School 3'
df.status[df.status.str.contains(txt)].str[len(txt) + 2:]   # +2 for skipping ": " after the school name

Result:

2    Closed
Name: status, dtype: object

However, IMO it would be even more clear by firstly splitting the single column which contains two informations in two:

df = df.status.str.split(': ', expand=True)
df.columns = ['school', 'status']

#     school        status
#0  School 1  1 Hour Delay
#1  School 2  2 Hour Delay
#2  School 3        Closed

then you can simply retrieve the contents of column two via boolean indexing of column one:

txt = 'School 3'
df.status[df.school==txt]

#2    Closed
#Name: status, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM