简体   繁体   English

如何在 Pandas DataFrame 中找到特定的子字符串,然后获取它后面的文本?

[英]How can I find a specific substring in a Pandas DataFrame, and then get the text after it?

So I have a Pandas dataframe that I am getting from an html webpage.所以我有一个从 html 网页获取的 Pandas 数据框。 The dataframe is ONLY 1 column and that column has no identifying name.数据框只有 1 列,该列没有识别名称。 I want to find a specific substring from within the dataframe, and then get the text immediately following that substring.我想从数据框中找到一个特定的子字符串,然后立即获取该子字符串之后的文本。

Note: there will NEVER be repeats in the substring search.注意:在子字符串搜索中永远不会重复。
Eg: there will NEVER be 2 instances of School 2:例如:学校 2 永远不会有 2 个实例:

The dataframe is formatted like this:数据框的格式如下:

School 1: 1 Hour Delay
School 2: 2 Hour Delay
School 3: Closed

I want to be able to search for School 3: and then return the status, whether it be closed, 1 hour delay, or 2 hour delay.我希望能够搜索学校 3:然后返回状态,是关闭、延迟 1 小时还是延迟 2 小时。

My initial thought was just if "School 3:" in df print("School 3: found") But I just get an error from that, I'm assuming because you can't just check for a string like that.我最初的想法是if "School 3:" in df print("School 3: found")但我只是从中得到一个错误,我假设是因为你不能只检查这样的字符串。 If anyone knows how to find a substring and then get the text after it I would love to know.如果有人知道如何找到一个子字符串,然后在它之后获取文本,我很想知道。

Assuming exactly one row always matches this condition, you can use str.extract :假设只有一行总是匹配这种情况下,你可以使用str.extract

df.iloc[:,0].str.extract('(?<=School 3: )(.*)', expand=False).dropna().values[0]
# 'Closed'

(Note: if more than one row matches this condition, only the status of the first match is returned.) (注意:如果多行符合此条件,则仅返回第一条匹配的状态。)

Otherwise, if it is possible nothing matches, you will need a try-except:否则,如果可能没有匹配项,您将需要一个 try-except:

try:
    status = (df.iloc[:,0]
                .str.extract('(?<=School 3: )(.*)', expand=False)
                .dropna()
                .values[0])    
except (IndexError, ValueError):
    status = np.nan

Supposed the dataframe looks like假设数据框看起来像

                   status
0  School 1: 1 Hour Delay
1  School 2: 2 Hour Delay
2        School 3: Closed

you could do你可以

txt = 'School 3'
df.status[df.status.str.contains(txt)].str[len(txt) + 2:]   # +2 for skipping ": " after the school name

Result:结果:

2    Closed
Name: status, dtype: object

However, IMO it would be even more clear by firstly splitting the single column which contains two informations in two:但是,IMO 通过首先将包含两个信息的单列拆分为两列会更加清晰:

df = df.status.str.split(': ', expand=True)
df.columns = ['school', 'status']

#     school        status
#0  School 1  1 Hour Delay
#1  School 2  2 Hour Delay
#2  School 3        Closed

then you can simply retrieve the contents of column two via boolean indexing of column one:那么您可以通过第一列的布尔索引简单地检索第二列的内容:

txt = 'School 3'
df.status[df.school==txt]

#2    Closed
#Name: status, dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM