[英]Extract string using Regex and Python
I have a column in a df contains the following values:我在 df 中有一个列包含以下值:
>>> import pandas as pd
>>> df = pd.DataFrame({'Sentence':['This is the results of my experiments KEY_abc_def', 'I have researched the product KEY_abc_def as requested', 'He got the idea from your message KEY_mno_pqr']})
>>> df
Sentence
0 This is the results of my experiments KEY_abc_def
1 I have researched the product KEY_abc_def as requested
2 e got the idea from your message KEY_mno_pqr
I would like to use regex to extract (or duplicate) the KEY into a new column without the actual "KEY_".我想使用正则表达式将 KEY 提取(或复制)到没有实际“KEY_”的新列中。 The output should be as below:
output 应如下所示:
>>> df
Sentence KEY
0 This is the results of my experiments KEY_abc_def abc_def
1 I have researched the product KEY_abc_def as requested abc_def
2 He got the idea from your message KEY_mno_pqr mno_pqr
I tried with this code but it is not working.我尝试使用此代码,但它不起作用。 Any suggestions would greatly be appreciated.
任何建议将不胜感激。
df['KEY']= df.Sentence.str.extract("KEY_", expand=True)
If you only expect word chars, that is letters, digit and underscores use如果您只期望单词字符,即字母、数字和下划线使用
df['KEY']= df['Sentence'].str.extract(r"KEY_(\w+)", expand=False)
If the KEY_
must a beginning of a word, you should add \b
word boundary in front of it: r"\bKEY_(\w+)"
.如果
KEY_
必须是单词的开头,则应在其前面添加\b
单词边界: r"\bKEY_(\w+)"
。
Since Series.str.extract
only returns the captured text if a capturing group is used in the pattern, the regex will only return the part matched with \w+
and \bKEY_
will be matched but discarded from the result.由于
Series.str.extract
仅在模式中使用捕获组时才返回捕获的文本,因此正则表达式将仅返回与\w+
匹配的部分,而\bKEY_
将被匹配但从结果中丢弃。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.