简体   繁体   English

获取数据框中最常见(常见)单词的平均分数

[英]Get the average scores for the most common (frequent) words in a dataframe

I am trying to get the average scores for the most common words in my dataframes. 我正在尝试获取数据框中最常见单词的平均分数。 Currently my dataframe has this format. 目前,我的数据框具有这种格式。

sentence            |    score
"Sam I am Sam"      |      10
"I am Sam"          |      5
"Paul is great Sam" |      5
"I am great"        |      0 
"Sam Sam Sam"       |      15

I managed to successfully get the most common words using this blurp of code. 我设法使用这段代码使最常用的单词成功。 This cleaned up my dataframe and removed all stop words. 这清理了我的数据框并删除了所有停用词。 Which yielded me this series. 这让我产生了这个系列。

from collections import Counter

nltk.download('stopwords')
df_text = df[['sentence','score']]
df_text['sentence'] = df_text['sentence'].replace("[a-zA-Z0-9]{14}|rt|[0-9]",'',regex=True, inplace=False)
df_text['sentence'] = df_text['sentence'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
top_words =pd.Series(' '.join(df_text['sentence']).lower().split()).value_counts()[:25]
Words     |    Freq
Sam       |     7
I         |     3
Am        |     3 
Great     |     2
is        |     1

I understand that the groupby.().mean() is a really important function I would need to use, but I dont understand how I would try to get the score column. 我知道groupby。()。mean()是我需要使用的一个非常重要的功能,但是我不知道如何尝试获取得分列。 This is the ideal output I am trying to get. 这是我想要获得的理想输出。 I showed the math to give logic on how I got the averages. 我展示了数学方法以给出关于如何获得平均值的逻辑。

Words     |    Avg
Sam       |     35/7 = 5
I         |     15/3 = 5
Am        |     15/3 = 5
Great     |     5/2 = 2.5
is        |     5/1 = 5

I will skip the data cleaning part (such as stopword removal), except that you really should use nltk.word_tokenize instead of split() . 除了要真正使用nltk.word_tokenize而不是split()之外,我将跳过数据清理部分(例如停用词删除split() In particular, it would be your responsibility to eliminate the quotes. 特别是,您有责任消除引号。

df['words'] = df['sentence'].apply(nltk.word_tokenize)

Once the words are extracted, count them and combine with the scores: 提取单词后,对它们进行计数并与分数结合:

word_counts = pd.concat([df[['score']],
                         df['words'].apply(Counter).apply(pd.Series)], 
                        axis=1)

Now, calculate the weighted sums: 现在,计算加权总和:

ws = word_counts.notnull().mul(word_counts['score'], axis=0).sum() \
                                               / word_counts.sum()
#score    1.0
#``       7.0
#Sam      5.0
#I        5.0
#am       5.0
#''       7.0
#Paul     5.0
#is       5.0
#great    2.5

Finally, eliminate the first row that was included only for convenience: 最后,消除仅出于方便起见而包括的第一行:

del(ws['score'])

considering you have your data in a tabular format.. this should work 考虑到您以表格格式存储数据。这应该可以

import pandas as pd
from collections import Counter

df = pd.read_csv('data.csv')
cnt = Counter([word for sen in df.sentence.values for word in sen.split()])

for item in cnt:
    tot_score = 0
    for row in df.iterrows():
        if item in row[1]['sentence'].split():
            tot_score += row[1]['score']
    if cnt[item] != 0:
        print(item, tot_score/cnt[item])
    else:
        print(item, 0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM