Let's say my dataframe looks like this:
column_name
1 book
2 fish
3 icecream|book
4 fish
5 campfire|book
Now, if I use df['column_name'].value_counts()
it will tell me fish
is the most frequent value.
However, I want book
to be returned, since row 1, 3 and 5 contain the word 'book'.
I know .value_counts()
recognizes icecream|book
as one value, but is there a way I can determine the most frequent value by counting the amount of times each column cell CONTAINS a certain value, so that 'book' will the most frequent value?
Use split
with stack
for Series
:
a = df['column_name'].str.split('|', expand=True).stack().value_counts()
print (a)
book 3
fish 2
icecream 1
campfire 1
dtype: int64
Or Counter
with list comprehension with flattening:
from collections import Counter
a = pd.Series(Counter([y for x in df['column_name'] for y in x.split('|')]))
print (a)
book 3
fish 2
icecream 1
campfire 1
dtype: int64
pd.value_counts
You can also pass a list to the value_counts
function. Note I join
by |
then split by |
.
pd.value_counts('|'.join(df.column_name).split('|'))
book 3
fish 2
icecream 1
campfire 1
dtype: int64
get_dummies
This works because your data is structured with |
as the separator. If you had a different separator, pass it to the get_dummies
call df.column_name.str.get_dummies(sep='|').sum()
df.column_name.str.get_dummies().sum()
book 3
campfire 1
fish 2
icecream 1
dtype: int64
If you want the results sorted
df.column_name.str.get_dummies().sum().sort_values(ascending=False)
book 3
fish 2
icecream 1
campfire 1
dtype: int64
pd.factorize
and np.bincount
Note that I join
the entire column and split again.
f, u = pd.factorize('|'.join(df.column_name).split('|'))
pd.Series(np.bincount(f), u)
book 3
fish 2
icecream 1
campfire 1
dtype: int64
To sort, we can use sort_values
as we did above. Or this
f, u = pd.factorize('|'.join(df.column_name).split('|'))
counts = np.bincount(f)
a = counts.argsort()[::-1]
pd.Series(counts[a], u[a])
book 3
fish 2
campfire 1
icecream 1
dtype: int64
Using collections.Counter
+ itertools.chain
:
from collections import Counter
from itertools import chain
c = Counter(chain.from_iterable(df['column_name'].str.split('|')))
res = pd.Series(c)
print(res)
book 3
campfire 1
fish 2
icecream 1
dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.