简体   繁体   中英

Pandas replace regex: why this negation does not work

I have the following dataframe:

>>> df = pd.DataFrame(['0123_GRP_LE_BNS', 'ABC_GRP_BNS', 'DEF_GRP', '456A_GRP_SSA'], columns=['P'])
>>> df
                 P
0  0123_GRP_LE_BNS
1      ABC_GRP_BNS
2          DEF_GRP
3     456A_GRP_SSA

and want to remove characters appear after GRP if they are not '_LE', or remove characters after GRP_LE.

The desired output is:

0     0123_GRP_LE
1         ABC_GRP
2         DEF_GRP
3        456A_GRP

I used the following pattern matching. the ouput was not expected:

>>> df['P'].replace({r'(.*_GRP)[^_LE].*':r'\1', r'(.*GRP_LE)_.*':r'\1'}, regex=True)
0     0123_GRP_LE
1     ABC_GRP_BNS
2         DEF_GRP
3    456A_GRP_SSA
Name: P, dtype: object

Why the negation in r'(.*_GRP)[^_LE].*' does not work?

Why not make _LE optional?

df['P'].str.replace(r'(GRP(?:_LE)?).*', r'\1', regex=True)

Output:

0    0123_GRP_LE
1        ABC_GRP
2        DEF_GRP
3       456A_GRP
Name: P, dtype: object

I find pythons string ops easier to work with and less error prone than regex; I think this does what you're looking for:

def strip_code(code_str):
    if "GRP_LE" in code_str:
        return "".join(code_str.partition("GRP_LE")[0:2])
    elif "GRP" in code_str:
        return "".join(code_str.partition("GRP")[0:2])
    return code_str


df.P.apply(strip_code)

output:

0    0123_GRP_LE
1        ABC_GRP
2        DEF_GRP
3       456A_GRP
Name: P, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM