创建常用词列表时出现意外 output。如何获得给定 class 的前 10 个最常用词？

Question

I am trying to get the top 10 most frequent words per class in my dataset.我正在尝试获取数据集中每个 class 的前 10 个最常用词。 I have the following Python code but I do not understand the output, why this has occurred and how it can be corrected.我有以下 Python 代码，但我不明白 output，为什么会发生这种情况以及如何更正。

Below is the dataset I am using (df)下面是我正在使用的数据集 (df)

User    Post    Label
0   Nicholas Wyman  Exploring in this months Talent Management HR...    Recruitment
1   Nicholas Wyman  I count myself fortunate to have spent time wi...   Career
2   Nicholas Wyman  This years National Apprenticeship Week comes ...   Recruitment
3   Nicholas Wyman  How will your company tap into workers as a co...   Wellbeing
4   Nicholas Wyman  The momentum for Modern Apprenticeships is bui...   Recruitment

This is the code I am using这是我正在使用的代码

#Import dataset
df = pd.read_csv("Folds1345.csv", engine='python',encoding='latin-1')

#Get classes
classes = df['Label'].unique()
classes = classes.tolist()

#Check each class and produce top 10 words
for i in classes:
  print(i)
  df2=df.loc[df['Label'] == i, 'Post']
  df2 = str(remove_stopwords(df['Post']))
  from collections import Counter
  Frequent = Counter(" ".join(df2).split()).most_common(10)
  print(Frequent)

And this is the output这是 output

Recruitment
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Career
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Wellbeing
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Rewards
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Technology
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Learning
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
HR System
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Inclusion
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Diversity
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]

It seems to be looking at individual letters rather than words and searching the entire dataset rather than just the posts with the chosen label, but I cannot work out why.它似乎在查看单个字母而不是单词，并搜索整个数据集而不仅仅是带有所选 label 的帖子，但我无法弄清楚为什么。

Answer 1

#Import dataset
df = pd.read_csv("Folds1345.csv", engine='python',encoding='latin-1')

#Get classes
classes = df['Label'].unique()
classes = classes.tolist()

for i in classes:
  print(i)
  df2=df.loc[df['Label'] == i, 'Post']
  df2 = df2.apply(lambda x: remove_stopwords(x))
  list_sentences = df2.to_list()
  from collections import Counter
  list_words = (' '.join(str(s) for s in list_sentences)).split(' ')
  Frequent = Counter(list_words).most_common(10)
  print(Frequent)

EDIT: You df2 is first a pandas series and then a string.编辑：你 df2 首先是一个 pandas 系列，然后是一个字符串。 I am not sure what "remove_stopwords" function you are using, I guess it is the one from gensim.我不确定您使用的是什么“remove_stopwords”function，我想这是来自 gensim 的。 I adapted the code我修改了代码

EDIT2: this time it should work EDIT2：这次它应该工作

创建常用词列表时出现意外 output。如何获得给定 class 的前 10 个最常用词？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-05-04 08:09:15

创建常用词列表时出现意外 output。 如何获得给定 class 的前 10 个最常用词？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-05-04 08:09:15

创建常用词列表时出现意外 output。如何获得给定 class 的前 10 个最常用词？

解决方案1
0 已采纳 2022-05-04 08:09:15