简体   繁体   中英

str.extract vs str.count regex usage in pandas

I am trying to understand why str.extract"([\\d+%])" returns NaN while str.count"([\\d+%])" returns the correct answer when parsing text within a column of a dataframe.

For example,

df = pd.DataFrame({'Subject':['3 hrs only! 35% off', 'Secret Savings!', 'Sale: 40% off']})
pattern = re.compile(r"(\d+%)")
df['Discount'] = df['Subject'].str.count(pattern)

...yields a Discount column with "1's" in row 1 and 3 as you would expect. However,

df['Discount'] = df['Subject'].str.extract(pattern)

...returns NaNs instead. I cannot understand why count can parse the percentages but extract does not. This is driving me a little crazy as it seems like it should be straightforward.

The bug was fixed in the subsequent Pandas version.

Now, with Pandas 0.24.2, you may use

>>> df.index=['a', 'b', 'c']
>>> df
               Subject Discount
a  3 hrs only! 35% off      35%
b      Secret Savings!      NaN
c        Sale: 40% off      40%
>>> df['Subject'].str.extract(pattern)
     0
a  35%
b  NaN
c  40%

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM