获取数据框中最常见（常见）单词的平均分数

Question

我正在尝试获取数据框中最常见单词的平均分数。 目前，我的数据框具有这种格式。

sentence            |    score
"Sam I am Sam"      |      10
"I am Sam"          |      5
"Paul is great Sam" |      5
"I am great"        |      0 
"Sam Sam Sam"       |      15

我设法使用这段代码使最常用的单词成功。 这清理了我的数据框并删除了所有停用词。 这让我产生了这个系列。

from collections import Counter

nltk.download('stopwords')
df_text = df[['sentence','score']]
df_text['sentence'] = df_text['sentence'].replace("[a-zA-Z0-9]{14}|rt|[0-9]",'',regex=True, inplace=False)
df_text['sentence'] = df_text['sentence'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
top_words =pd.Series(' '.join(df_text['sentence']).lower().split()).value_counts()[:25]
Words     |    Freq
Sam       |     7
I         |     3
Am        |     3 
Great     |     2
is        |     1

我知道groupby。（）。mean（）是我需要使用的一个非常重要的功能，但是我不知道如何尝试获取得分列。 这是我想要获得的理想输出。 我展示了数学方法以给出关于如何获得平均值的逻辑。

Words     |    Avg
Sam       |     35/7 = 5
I         |     15/3 = 5
Am        |     15/3 = 5
Great     |     5/2 = 2.5
is        |     5/1 = 5

Answer 1

除了要真正使用nltk.word_tokenize而不是split()之外，我将跳过数据清理部分（例如停用词删除split() 。 特别是，您有责任消除引号。

df['words'] = df['sentence'].apply(nltk.word_tokenize)

提取单词后，对它们进行计数并与分数结合：

word_counts = pd.concat([df[['score']],
                         df['words'].apply(Counter).apply(pd.Series)], 
                        axis=1)

现在，计算加权总和：

ws = word_counts.notnull().mul(word_counts['score'], axis=0).sum() \
                                               / word_counts.sum()
#score    1.0
#``       7.0
#Sam      5.0
#I        5.0
#am       5.0
#''       7.0
#Paul     5.0
#is       5.0
#great    2.5

最后，消除仅出于方便起见而包括的第一行：

del(ws['score'])

Answer 2

考虑到您以表格格式存储数据。这应该可以

import pandas as pd
from collections import Counter

df = pd.read_csv('data.csv')
cnt = Counter([word for sen in df.sentence.values for word in sen.split()])

for item in cnt:
    tot_score = 0
    for row in df.iterrows():
        if item in row[1]['sentence'].split():
            tot_score += row[1]['score']
    if cnt[item] != 0:
        print(item, tot_score/cnt[item])
    else:
        print(item, 0)

获取数据框中最常见（常见）单词的平均分数

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-03-21 02:49:23

解决方案2
0 2019-03-21 07:51:34

获取数据框中最常见（常见）单词的平均分数

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-03-21 02:49:23

解决方案2 0 2019-03-21 07:51:34

解决方案1
1 已采纳 2019-03-21 02:49:23

解决方案2
0 2019-03-21 07:51:34