I am doing some NLP work
my original dataframe is df_all
Index Text
1 Hi, Hello, this is mike, I saw your son playing in the garden...
2 Besides that, sometimes my son studies math for fun...
3 I cannot believe she said that. she always says such things...
I converted my texts to BOW data frame
so my dataframe df_BOW
looks like this now
Index Hi This my son play garden ...
1 3 6 3 0 2 4
2 0 2 4 4 3 1
3 0 2 0 7 3 0
I want to find how many times each word appeared in the corpus
cnt_pro = df_all['Text'].value_counts()
plt.figure(figsize=(12,4))
sns.barplot(cnt_pro.index, cnt_pro.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show();
to get top words like this
but I get this chart that shows no info
How can I fix that?
I'm not sure how you are creating df_BOW
but it's not in an ideal format for plotting.
df_all = pd.DataFrame(
{
"text": [
"Hi, Hello, this is mike, I saw your son playing in the garden",
"Besides that, sometimes my son studies math for fun",
"I cannot believe she said that. she always says such things",
]
}
)
Similarly to RF Adriaansen's answer we can use a regex to extract the words, but instead we will only use pandas methods:
counts = df["text"].str.findall(r"(\w+)").explode().value_counts()
Series.str.findall
: apply the regex (\w+)
to capture all words. This returns a Series
of lists.Series.explode
: Transform each element of a list-like to a row. Series.value_counts
: Return a Series containing counts of unique values. counts
is a series with the index being the word and the value being the count:
son 2
she 2
I 2
...
says 1
garden 1
math 1
Name: text, dtype: int64
Then to plot:
fig, ax = plt.subplots(figsize=(6,5))
sns.barplot(x=counts.index, y=counts.values, ax=ax)
ax.set_ylabel('Number of Occurrences', fontsize=12)
ax.set_xlabel('Word', fontsize=12)
ax.xaxis.set_tick_params(rotation=90)
If you jsut want the to N most frequent words you can use nlargest
like so:
top_10 = counts.nlargest(10)
and plot in the same way.
You could use collections.Counter
to count the words:
import pandas as pd
import seaborn as sns
from collections import Counter
import re
import matplotlib.pyplot as plt
data = ['Hi, Hello, this is mike, I saw your son playing in the garden', 'Besides that, sometimes my son studies math for fun', 'I cannot believe she said that. she always says such things']
df = pd.DataFrame(data, columns=['text'])
df['text_split'] = df['text'].apply(lambda x: re.findall(r'\w+', x)) #split sentences to words with regex
words = [item.lower() for sublist in df['text_split'].tolist() for item in sublist] # flattens the list of lists and lowers the words
counted_words = Counter(words)
counted_df = pd.DataFrame(counted_words.items(), columns=['word', 'count']).sort_values('count', ascending=False).reset_index(drop=True) #create new df from counter
plt.figure(figsize=(12,4))
sns.barplot(data=counted_df[:10], x='word', y='count', alpha=0.8) #plot only the top 10 by slicing the df
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show()
Result:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.