pandas 正则表达式从第一次出现的字符向前看和向后看

Question

I have python strings like below我有 python 个字符串，如下所示

"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"

I would like to do the below我想做以下

a) extract characters that appear before and after 1st dot a) 提取出现在第一个点之前和之后的字符

b) The keywords that I want are always found after the last _ symbol b) 我想要的关键字总是在最后一个_符号之后找到

For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st .例如：如果您查看第二个输入字符串，我只想获得PQRST.GHI为 output。它在 last _之后和 1st 之前. and we also get keyword after 1st .我们还在 1st 之后获得关键字.

So, I tried the below所以，我尝试了以下

for s in strings:
   after_part = (s.split('.')[1])
   before_part = (s.split('.')[0])
   before_part = qnd_part.split('_')[-1]
   expected_keyword = before_part + "." + after_part
   print(expected_keyword)

Though this works, this is definitely not nice and elegant way to write a regex.虽然这可行，但这绝对不是编写正则表达式的好方法。

Is there any other better way to write this?还有其他更好的写法吗？

I expect my output to be like as below.我希望我的 output 如下所示。 As you can see that we get keywords before and after 1st dot character如您所见，我们在第一个dot字符前后获取关键字

GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

Answer 1

You can do (try the pattern here )你可以这样做（在这里尝试模式）

df['text'].str.extract('_([^._]+\.[^.]+)',expand=False)

Output: Output：

0    ABCDEF.GHI
1     PQRST.GHI
2     JKLMN.OPQ
3       WXY.TUV
Name: text, dtype: object

Answer 2

Try ( regex101 ):尝试（ regex101 ）：

import re

strings = [
    "1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]

pat = re.compile(r"[^.]+_([^.]+\.[^.]+)")

for s in strings:
    print(pat.search(s).group(1))

Prints:印刷：

ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

Answer 3

You can also do it with rsplit() .您也可以使用rsplit()来完成。 Specify maxsplit , so that you don't split more than you need to (for efficiency):指定maxsplit ，这样你就不会分裂得比你需要的多（为了提高效率）：

[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.如果有少于 2 个点的字符串并且每个返回的字符串中应该有一个点，则添加一个三元运算符，根据字符串中的点数拆分（或不拆分）。

[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x 
 for s in strings
 for x in [s.rsplit('_', maxsplit=1)[1]]]

# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

pandas 正则表达式从第一次出现的字符向前看和向后看

问题描述

3 个解决方案

解决方案1
2 2022-09-29 14:00:26

解决方案2
2 已采纳 2022-09-29 14:00:58

解决方案3
2 2022-09-29 14:03:08

pandas 正则表达式从第一次出现的字符向前看和向后看

问题描述

3 个解决方案

解决方案1 2 2022-09-29 14:00:26

解决方案2 2 已采纳 2022-09-29 14:00:58

解决方案3 2 2022-09-29 14:03:08

解决方案1
2 2022-09-29 14:00:26

解决方案2
2 已采纳 2022-09-29 14:00:58

解决方案3
2 2022-09-29 14:03:08