简体   繁体   中英

Pandas str.count

Consider the following dataframe. I want to count the number of '$' that appear in a string. I use the str.count function in pandas ( http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.count.html ).

>>> import pandas as pd
>>> df = pd.DataFrame(['$$a', '$$b', '$c'], columns=['A'])
>>> df['A'].str.count('$')
0    1
1    1
2    1
Name: A, dtype: int64

I was expecting the result to be [2,2,1] . What am I doing wrong?

In Python, the count function in the string module returns the correct result.

>>> a = "$$$$abcd"
>>> a.count('$')
4
>>> a = '$abcd$dsf$'
>>> a.count('$')
3

$ has a special meaning in RegEx - it's end-of-line, so try this:

In [21]: df.A.str.count(r'\$')
Out[21]:
0    2
1    2
2    1
Name: A, dtype: int64

As the other answers have noted, the issue here is that $ denotes the end of the line. If you do not intend to use regular expressions, you may find that using str.count (that is, the method from the built-in type str ) is faster than its pandas counterpart;

In [39]: df['A'].apply(lambda x: x.count('$'))
Out[39]: 
0    2
1    2
2    1
Name: A, dtype: int64

In [40]: %timeit df['A'].str.count(r'\$')
1000 loops, best of 3: 243 µs per loop

In [41]: %timeit df['A'].apply(lambda x: x.count('$'))
1000 loops, best of 3: 202 µs per loop

Try pattern [$] so it doesn't treat $ as end of character (see this cheatsheet ) if you place it in square brackets [] then it treats it as a literal character:

In [3]:
df = pd.DataFrame(['$$a', '$$b', '$c'], columns=['A'])
df['A'].str.count('[$]')

Out[3]:
0    2
1    2
2    1
Name: A, dtype: int64

taking a cue from @fuglede

pd.Series([x.count('$') for x in df.A.values.tolist()], df.index)

as pointed by @jezrael, the above fails when there is a null type, so...

def tc(x):
    try:
        return x.count('$')
    except:
        return 0

pd.Series([tc(x) for x in df.A.values.tolist()], df.index)

timings

np.random.seed([3,1415])
df = pd.Series(np.random.randint(0, 100, 100000)) \
       .apply(lambda x: '\$' * x).to_frame('A')

df.A.replace('', np.nan, inplace=True)

def tc(x):
    try:
        return x.count('$')
    except:
        return 0

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM