Count how many times a column contains a certain value in Pandas

Question

Let's say my dataframe looks like this:

   column_name
1  book
2  fish
3  icecream|book
4  fish
5  campfire|book

Now, if I use df['column_name'].value_counts() it will tell me fish is the most frequent value.

However, I want book to be returned, since row 1, 3 and 5 contain the word 'book'.

I know .value_counts() recognizes icecream|book as one value, but is there a way I can determine the most frequent value by counting the amount of times each column cell CONTAINS a certain value, so that 'book' will the most frequent value?

Answer 1

Use split with stack for Series :

a = df['column_name'].str.split('|', expand=True).stack().value_counts()
print (a)
book        3
fish        2
icecream    1
campfire    1
dtype: int64

Or Counter with list comprehension with flattening:

from collections import Counter

a = pd.Series(Counter([y for x in df['column_name'] for y in x.split('|')]))
print (a)
book        3
fish        2
icecream    1
campfire    1
dtype: int64

Answer 2

`pd.value_counts`

You can also pass a list to the value_counts function. Note I join by | then split by | .

pd.value_counts('|'.join(df.column_name).split('|'))

book        3
fish        2
icecream    1
campfire    1
dtype: int64

`get_dummies`

This works because your data is structured with | as the separator. If you had a different separator, pass it to the get_dummies call df.column_name.str.get_dummies(sep='|').sum()

df.column_name.str.get_dummies().sum()

book        3
campfire    1
fish        2
icecream    1
dtype: int64

If you want the results sorted

df.column_name.str.get_dummies().sum().sort_values(ascending=False)

book        3
fish        2
icecream    1
campfire    1
dtype: int64

`pd.factorize` and `np.bincount`

Note that I join the entire column and split again.

f, u = pd.factorize('|'.join(df.column_name).split('|'))
pd.Series(np.bincount(f), u)

book        3
fish        2
icecream    1
campfire    1
dtype: int64

To sort, we can use sort_values as we did above. Or this

f, u = pd.factorize('|'.join(df.column_name).split('|'))
counts = np.bincount(f)
a = counts.argsort()[::-1]
pd.Series(counts[a], u[a])

book        3
fish        2
campfire    1
icecream    1
dtype: int64

Answer 3

Using collections.Counter + itertools.chain :

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(df['column_name'].str.split('|')))

res = pd.Series(c)

print(res)

book        3
campfire    1
fish        2
icecream    1
dtype: int64

Count how many times a column contains a certain value in Pandas

Question

3 answers

solution1
6 ACCPTED 2018-06-25 15:53:21

solution2
4 2018-06-25 15:55:53

`pd.value_counts`

`get_dummies`

`pd.factorize` and `np.bincount`

solution3
2 2018-06-25 16:00:47

Count how many times a column contains a certain value in Pandas

Question

3 answers

solution1 6 ACCPTED 2018-06-25 15:53:21

solution2 4 2018-06-25 15:55:53

pd.value_counts

get_dummies

pd.factorize and np.bincount

solution3 2 2018-06-25 16:00:47

solution1
6 ACCPTED 2018-06-25 15:53:21

solution2
4 2018-06-25 15:55:53

`pd.value_counts`

`get_dummies`

`pd.factorize` and `np.bincount`

solution3
2 2018-06-25 16:00:47