I have 5 sentences in a np.array and I want to find the most common n number of words that appear along with their relative count. For example if n was 3 I would want the 3 most common words. As relative count I want the number of times that word appeared divided by the total number of words. I have an example below:
0 oh i am she cool though might off her a brownie lol
1 so trash wouldnt do colors better tweet
2 love monkey brownie as much as a tweet
3 monkey get this tweet around i think
4 saw a brownie to make me some monkey
With the help of a previous question I manage to find the most common words
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
A = np.array(["oh i am she cool though might off her a brownie lol",
"so trash wouldnt do colors better tweet",
"love monkey brownie as much as a tweet",
"monkey get this tweet around i think",
"saw a brownie to make me some monkey" ])
n = 3
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(A)
vocabulary = vectorizer.get_feature_names()
ind = np.argsort(X.toarray().sum(axis=0))[-n:]
top_n_words = [vocabulary[a] for a in ind]
print (top_n_words)
['tweet', 'monkey', 'brownie']
However, now I want to find the relative count? Is there a straightforward pythonic way to do this? For example:
print (top_n_words_relative_count)
[3/42, 3/42, 3/42]
Where 42 is the total number of words.
You can use collections.Counter
:
>>> A = np.array(["oh i am she cool though might off her a brownie lol",
"so trash wouldnt do colors better tweet",
"love monkey brownie as much as a tweet",
"monkey get this tweet around i think",
"saw a brownie to make me some monkey" ])
>>> B = ' '.join(A).split()
>>> top_n_words, top_n_words_count = zip(*Counter(B).most_common(3))
>>> top_n_words_relative_count = np.array(tom_n_words_count)/len(B)
>>> top_n_words
('a', 'brownie', 'tweet')
>>> top_n_words_relative_count
array([0.07142857, 0.07142857, 0.07142857])
If you want formatted:
>>> [f"{count}/{len(B)}" for count in top_n_words_count]
['3/42', '3/42', '3/42']
Or, if you go to pandas
you can use value_counts
and nlargest
:
>>> import pandas as pd
>>> B = pd.Series(' '.join(A).split())
>>> B = B.value_counts(normalize=True).nlargest(3)
monkey 0.071429
a 0.071429
tweet 0.071429
dtype: float64
>>> B.index.tolist()
['monkey', 'a', 'tweet']
>>> B.values.tolist()
[0.07142857142857142, 0.07142857142857142, 0.07142857142857142]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.