简体   繁体   中英

Tied up in knots with lookbehind and negative lookbehind assertions in regular expressions in Python

I have a Pandas dataframe which has a column of string data that consists of two distinct parts separated by forward slash. I want to extract text patterns from the 'right hand side' of the string but not if a particular string pattern is present. The following trivial example illustrates the issue.

import numpy as np
import pandas as pd
import re

myDF = pd.DataFrame({'pet':['rabbit','mammal/rabbit','mammal/small fluffy rabbit','mammal/lop-eared rabbit','mammal/many rabbits','mammal/jack rabbit']})

So, the dataframe looks like:

                          pet
0                      rabbit
1               mammal/rabbit
2  mammal/small fluffy rabbit
3     mammal/lop-eared rabbit
4         mammal/many rabbits
5          mammal/jack rabbit

I want to be able to extract rabbit-related terms but only if they occur to the right-hand side of a / separator and not if rabbit is preceded by jack (with or without an intervening space).

The regex I've come up with is:

rxStr = '(?P<bunny>(?<=/)(?<!jack)(?:.*rabbits?))'

...which I hoped would require any matches to be preceded by / but not if preceded by jack . However, it does not work as I had hoped. I've tried lots of variations without any luck.

rxStr = '(?P<bunny>(?<=/)(?<!jack)(?:.*rabbits?))'

rx = re.compile(rxStr,flags=re.I|re.X)

rabbitDF = myDF['pet'].str.extract(rx,expand=True)

myDF = myDF.join(rabbitDF)

print(myDF)

                          pet                bunny
0                      rabbit                  NaN
1               mammal/rabbit               rabbit
2  mammal/small fluffy rabbit  small fluffy rabbit
3     mammal/lop-eared rabbit     lop-eared rabbit
4         mammal/many rabbits         many rabbits
5          mammal/jack rabbit          jack rabbit

In row 0, the regex correctly fails to find a match because there is no / character. However, in row 5 jack rabbit is matched despite jack preceding rabbit .

How can I write a regular expression that would identify rabbit terms but only if preceded by / and not if preceded by jack ? Any explanation of why the regex given above fails would also be very much appreciated.

Use a lookahead instead of a lookbehind:

myDF.pet.str.extract('(?P<bunny>(?<=/)(?!jack).*rabbit)', expand=True)

                 bunny
0                  NaN
1               rabbit
2  small fluffy rabbit
3     lop-eared rabbit
4          many rabbit
5                  NaN

(               # capture group
    (?<=/)      # lookbehind - forwardslash
    (?!jack)    # negative lookahead - "jack" 
    .*          # match anything
    rabbit      # match "rabbit"
)

Here, the negative lookahead implies that a fwslash must not be followed by "jack".

In [52]:  myDF['pet'].str.extract(r'/(?P<bunny>(?!jack).*rabbits?.*)',expand=True)
Out[52]:
                 bunny
0                  NaN
1               rabbit
2  small fluffy rabbit
3     lop-eared rabbit
4         many rabbits
5                  NaN

RegEx explained ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM