简体   繁体   中英

How to find NLP words count and plot it?

I am doing some NLP work

my original dataframe is df_all

Index    Text
1        Hi, Hello, this is mike, I saw your son playing in the garden...
2        Besides that, sometimes my son studies math for fun...
3        I cannot believe she said that. she always says such things...

I converted my texts to BOW data frame

so my dataframe df_BOW looks like this now

Index    Hi   This   my   son   play   garden ...
1        3    6      3    0     2       4
2        0    2      4    4     3       1
3        0    2      0    7     3       0

I want to find how many times each word appeared in the corpus

cnt_pro = df_all['Text'].value_counts()
plt.figure(figsize=(12,4))
sns.barplot(cnt_pro.index, cnt_pro.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show();

to get top words like this

在此处输入图像描述

but I get this chart that shows no info

我

How can I fix that?

I'm not sure how you are creating df_BOW but it's not in an ideal format for plotting.

df_all = pd.DataFrame(
    {
        "text": [
            "Hi, Hello, this is mike, I saw your son playing in the garden",
            "Besides that, sometimes my son studies math for fun",
            "I cannot believe she said that. she always says such things",
        ]
    }
)

Similarly to RF Adriaansen's answer we can use a regex to extract the words, but instead we will only use pandas methods:

counts = df["text"].str.findall(r"(\w+)").explode().value_counts()

counts is a series with the index being the word and the value being the count:

son          2
she          2
I            2
...
says         1
garden       1
math         1
Name: text, dtype: int64

Then to plot:

fig, ax = plt.subplots(figsize=(6,5))
sns.barplot(x=counts.index, y=counts.values, ax=ax)
ax.set_ylabel('Number of Occurrences', fontsize=12)
ax.set_xlabel('Word', fontsize=12)
ax.xaxis.set_tick_params(rotation=90)

在此处输入图像描述

If you jsut want the to N most frequent words you can use nlargest like so:

top_10 = counts.nlargest(10)

and plot in the same way.

You could use collections.Counter to count the words:

import pandas as pd
import seaborn as sns
from collections import Counter
import re
import matplotlib.pyplot as plt

data = ['Hi, Hello, this is mike, I saw your son playing in the garden', 'Besides that, sometimes my son studies math for fun', 'I cannot believe she said that. she always says such things']
df = pd.DataFrame(data, columns=['text'])

df['text_split'] = df['text'].apply(lambda x: re.findall(r'\w+', x)) #split sentences to words with regex
words = [item.lower() for sublist in df['text_split'].tolist() for item in sublist] # flattens the list of lists and lowers the words

counted_words = Counter(words)
counted_df = pd.DataFrame(counted_words.items(), columns=['word', 'count']).sort_values('count', ascending=False).reset_index(drop=True) #create new df from counter

plt.figure(figsize=(12,4))
sns.barplot(data=counted_df[:10], x='word', y='count', alpha=0.8) #plot only the top 10 by slicing the df
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show()

Result:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM