简体   繁体   English

使用正则表达式提取python字符串中的子字符串

[英]Extracting substring in python string using regular expression

I have a pandas column like this: 我有一个这样的熊猫专栏:

LOD-NY-EP-ADM
LOD-NY-EC-RUL
LOD-NY-EC-WFL
LOD-NY-LSM-SER
LOD-NY-PM-MOB
LOD-NY-PM-MOB
LOD-NY-RMK
LOD-NY-EC-TIM

I want the output in new column as 我希望新列中的输出为

EP
EC
EC
LSM
PM
PM
RMK
EC

I tried this: 我尝试了这个:

pattern=df.column[0:10].str.extract(r"\w*-NY-(.*?)-\w*",expand=False)

While it works for everything but it fails to get RMK out and gives NaN since there is nothing after that and it looks for -\\w zero or more times. 尽管它适用于所有内容,但无法得到RMK并给出NaN,因为此后没有任何内容,并且它查找-\\ w零次或多次。 But then that should work if there is nothing after RMK. 但是,如果在RMK之后什么也没有,那应该可以工作。

Any idea whats going wrong? 知道发生了什么事吗?

We can just use a array of these and use regular expression if pandas syntax is not familiar. 如果熊猫的语法不熟悉,我们可以只使用它们的数组并使用正则表达式。

Could you just use regular python? 您可以只使用常规的python吗? Let df be your dataframe, and row be the name of your row. 假设df是您的数据框,而row是您的行的名称。

series = df.row
new_list =  [i.split('-')[2] for i in series]
new_series = pd.Series(new_list)
pattern=df.column[0:10].str.extract(r"\w*-NY-(\w+)",expand=False)

See https://regex101.com/r/3uDpam/3 参见https://regex101.com/r/3uDpam/3

Your regex meant matching strings must have 3 - characters. 您正则表达式的意思字符串匹配必须有3 -字符。 I changed it so last -XX could occur 0 or 1 times. 我更改了它,所以最后-XX可能发生0或1次。

UPDATE: Changed so 2nd group is non-capturing (added ?: ) 更新:已更改,因此第二组不被捕获(已添加?: :)

UPDATE: Thanks to Casimir, removed useless group at end of pattern 更新:感谢卡西米尔,在模式结束时删除了无用的组

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM