I have text reviews in one column in Pandas dataframe and I want to count the N-most frequent words with their frequency counts (in whole column - NOT in single cell). One approach is Counting the words using a counter, by iterating through each row. Is there a better alternative?
Representative data.
0 a heartening tale of small victories and endu
1 no sophomore slump for director sam mendes w
2 if you are an actor who can relate to the sea
3 it's this memory-as-identity obviation that g
4 boyd's screenplay ( co-written with guardian
from collections import Counter
Counter(" ".join(df["text"]).split()).most_common(100)
im pretty sure would give you what you want (you might have to remove some non-words from the counter result before calling most_common)
Along with @Joran's solution you could also you use series.value_counts
for large amounts of text/rows
pd.Series(' '.join(df['text']).lower().split()).value_counts()[:100]
You would find from the benchmarks series.value_counts
seems twice (2X) faster than Counter
method
For Movie Reviews dataset of 3000 rows, totaling 400K characters and 70k words.
In [448]: %timeit Counter(" ".join(df.text).lower().split()).most_common(100)
10 loops, best of 3: 44.2 ms per loop
In [449]: %timeit pd.Series(' '.join(df.text).lower().split()).value_counts()[:100]
10 loops, best of 3: 27.1 ms per loop
I'm going to have to disagree with @Zero
For 91,000 strings (email address), I found collections.Counter(..).most_common(n)
to be faster. however, series.value_counts
may still be faster at if they are over 500k words
%%timeit
[i[0] for i in Counter(data_requester['requester'].values).most_common(5)]
# 13 ms ± 321 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
data_requester['requester'].value_counts().index[:5]
# 22.2 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.