I have a Pandas dataframe which has a column of string data that consists of two distinct parts separated by forward slash. I want to extract text patterns from the 'right hand side' of the string but not if a particular string pattern is present. The following trivial example illustrates the issue.
import numpy as np
import pandas as pd
import re
myDF = pd.DataFrame({'pet':['rabbit','mammal/rabbit','mammal/small fluffy rabbit','mammal/lop-eared rabbit','mammal/many rabbits','mammal/jack rabbit']})
So, the dataframe looks like:
pet
0 rabbit
1 mammal/rabbit
2 mammal/small fluffy rabbit
3 mammal/lop-eared rabbit
4 mammal/many rabbits
5 mammal/jack rabbit
I want to be able to extract rabbit-related terms but only if they occur to the right-hand side of a /
separator and not if rabbit
is preceded by jack
(with or without an intervening space).
The regex I've come up with is:
rxStr = '(?P<bunny>(?<=/)(?<!jack)(?:.*rabbits?))'
...which I hoped would require any matches to be preceded by /
but not if preceded by jack
. However, it does not work as I had hoped. I've tried lots of variations without any luck.
rxStr = '(?P<bunny>(?<=/)(?<!jack)(?:.*rabbits?))'
rx = re.compile(rxStr,flags=re.I|re.X)
rabbitDF = myDF['pet'].str.extract(rx,expand=True)
myDF = myDF.join(rabbitDF)
print(myDF)
pet bunny
0 rabbit NaN
1 mammal/rabbit rabbit
2 mammal/small fluffy rabbit small fluffy rabbit
3 mammal/lop-eared rabbit lop-eared rabbit
4 mammal/many rabbits many rabbits
5 mammal/jack rabbit jack rabbit
In row 0, the regex correctly fails to find a match because there is no /
character. However, in row 5 jack rabbit
is matched despite jack
preceding rabbit
.
How can I write a regular expression that would identify rabbit
terms but only if preceded by /
and not if preceded by jack
? Any explanation of why the regex given above fails would also be very much appreciated.
Use a lookahead instead of a lookbehind:
myDF.pet.str.extract('(?P<bunny>(?<=/)(?!jack).*rabbit)', expand=True)
bunny
0 NaN
1 rabbit
2 small fluffy rabbit
3 lop-eared rabbit
4 many rabbit
5 NaN
( # capture group
(?<=/) # lookbehind - forwardslash
(?!jack) # negative lookahead - "jack"
.* # match anything
rabbit # match "rabbit"
)
Here, the negative lookahead implies that a fwslash must not be followed by "jack".
In [52]: myDF['pet'].str.extract(r'/(?P<bunny>(?!jack).*rabbits?.*)',expand=True)
Out[52]:
bunny
0 NaN
1 rabbit
2 small fluffy rabbit
3 lop-eared rabbit
4 many rabbits
5 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.