简体   繁体   English

pandas 正则表达式从第一次出现的字符向前看和向后看

[英]pandas regex look ahead and behind from a 1st occurrence of character

I have python strings like below我有 python 个字符串,如下所示

"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"

I would like to do the below我想做以下

a) extract characters that appear before and after 1st dot a) 提取出现在第一个点之前和之后的字符

b) The keywords that I want are always found after the last _ symbol b) 我想要的关键字总是在最后一个_符号之后找到

For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st .例如:如果您查看第二个输入字符串,我只想获得PQRST.GHI为 output。它在 last _之后和 1st 之前. and we also get keyword after 1st .我们还在 1st 之后获得关键字.

So, I tried the below所以,我尝试了以下

for s in strings:
   after_part = (s.split('.')[1])
   before_part = (s.split('.')[0])
   before_part = qnd_part.split('_')[-1]
   expected_keyword = before_part + "." + after_part
   print(expected_keyword)

Though this works, this is definitely not nice and elegant way to write a regex.虽然这可行,但这绝对不是编写正则表达式的好方法。

Is there any other better way to write this?还有其他更好的写法吗?

I expect my output to be like as below.我希望我的 output 如下所示。 As you can see that we get keywords before and after 1st dot character如您所见,我们在第一个dot字符前后获取关键字

GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

You can do (try the pattern here )你可以这样做(在这里尝试模式)

df['text'].str.extract('_([^._]+\.[^.]+)',expand=False)

Output: Output:

0    ABCDEF.GHI
1     PQRST.GHI
2     JKLMN.OPQ
3       WXY.TUV
Name: text, dtype: object

Try ( regex101 ):尝试( regex101 ):

import re

strings = [
    "1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]

pat = re.compile(r"[^.]+_([^.]+\.[^.]+)")

for s in strings:
    print(pat.search(s).group(1))

Prints:印刷:

ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

You can also do it with rsplit() .您也可以使用rsplit()来完成。 Specify maxsplit , so that you don't split more than you need to (for efficiency):指定maxsplit ,这样你就不会分裂得比你需要的多(为了提高效率):

[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.如果有少于 2 个点的字符串并且每个返回的字符串中应该有一个点,则添加一个三元运算符,根据字符串中的点数拆分(或不拆分)。

[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x 
 for s in strings
 for x in [s.rsplit('_', maxsplit=1)[1]]]

# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM