str.extract vs str.count regex usage in pandas

Question

I am trying to understand why str.extract"([\\d+%])" returns NaN while str.count"([\\d+%])" returns the correct answer when parsing text within a column of a dataframe.

For example,

df = pd.DataFrame({'Subject':['3 hrs only! 35% off', 'Secret Savings!', 'Sale: 40% off']})
pattern = re.compile(r"(\d+%)")
df['Discount'] = df['Subject'].str.count(pattern)

...yields a Discount column with "1's" in row 1 and 3 as you would expect. However,

df['Discount'] = df['Subject'].str.extract(pattern)

...returns NaNs instead. I cannot understand why count can parse the percentages but extract does not. This is driving me a little crazy as it seems like it should be straightforward.

Answer 1

The bug was fixed in the subsequent Pandas version.

Now, with Pandas 0.24.2, you may use

>>> df.index=['a', 'b', 'c']
>>> df
               Subject Discount
a  3 hrs only! 35% off      35%
b      Secret Savings!      NaN
c        Sale: 40% off      40%
>>> df['Subject'].str.extract(pattern)
     0
a  35%
b  NaN
c  40%

str.extract vs str.count regex usage in pandas

Question

1 answers

solution1
0 2019-03-19 20:41:22

str.extract vs str.count regex usage in pandas

Question

1 answers

solution1 0 2019-03-19 20:41:22

solution1
0 2019-03-19 20:41:22