简体   繁体   English

dataframe 中特定行中的值的总和

[英]Sum of the values in specific rows in dataframe

I have a dataframe called 'test' like this:我有一个名为“测试”的 dataframe,如下所示:

import pandas as pd
from collections import Counter
from nltk import ngrams

data = [['john tom hello text shine bright', 10], ['random text hello text shine bright', 15], ['random text hello text shine bright juli', 14], 
       ['random text hello great shine bright', 15], ['random text hello great shine bright juli', 14]]

df = pd.DataFrame(data, columns=['Text', 'Value'])
df
Text文本 Value价值
john tom hello text shine bright约翰汤姆你好文本闪耀明亮 34 34
random text hello text shine bright随机文本你好文本闪耀明亮 42 42
random text hello text shine bright juli随机文字你好文字闪耀明亮juli 42 42
random text hello great shine bright随机文本你好伟大闪耀明亮 42 42
random text hello great shine bright juli随机文字你好伟大的光芒明亮的朱莉 42 42

I have the following code which looks for the most common phrase of 4 words, that looks like this:我有以下代码查找最常见的 4 个单词的短语,如下所示:

vals_df_1 = [y for x in df['Text'] for y in x.split()]
c_fourgrams = Counter([' '.join(y) for x in [4] for y in ngrams(vals_df_1, x)])

df_1_fourgrams = pd.DataFrame({'ngrams': list(c_fourgrams.keys()),
                   'count': list(c_fourgrams.values())})

df_1_fourgrams = df_1_fourgrams.sort_values('count', ascending=False)
df_1_fourgrams = df_1_fourgrams.head()
df_1_fourgrams

Then the dataframe 'df_1_fourgrams' looks like this:然后 dataframe 'df_1_fourgrams' 看起来像这样:

ngrams ngram count数数
hello text shine bright你好文字闪耀明亮 3 3
shine bright random text闪耀明亮的随机文字 3 3
bright random text hello明亮的随机文字你好 3 3
random text hello great随机文本你好 2 2
text shine bright random文字闪耀明亮随机 2 2

What I am missing is for each phrase to have a sum of the Value column.我缺少的是每个短语都有 Value 列的总和。 If it finds the phrase 'most common phrase is' in 5 rows, then I need to sum all of the values from the Value column in those 5 rows.如果它在 5 行中找到短语“最常见的短语是”,那么我需要对这 5 行中 Value 列中的所有值求和。

The resulting dataframe would look something like this:生成的 dataframe 看起来像这样:

ngrams ngram count数数 Value sum价值总和
hello text shine bright你好文字闪耀明亮 3 3 118 118
shine bright random text闪耀明亮的随机文字 3 3 118 118
bright random text hello明亮的随机文字你好 3 3 118 118
random text hello great随机文本你好 2 2 84 84
text shine bright random文字闪耀明亮随机 2 2 84 84

Is this possible?这可能吗? How could I do this?我怎么能这样做?

You can achieve everything with pandas from the original dataframe:您可以使用原始 dataframe 中的 pandas 实现一切:

out = (df
 .assign(words=[[' '.join(x) for x in ngrams(s.split(), 4)]
                for s in df['Text']])
 .explode('ngrams')
 .groupby('ngrams')['Value']
 .agg(['count', 'sum'])
 .sort_values('count', ascending=False)
 .head(5)
)

output: output:

                          count  sum
ngrams                              
hello text shine bright       3   39
hello great shine bright      2   29
random text hello great       2   29
random text hello text        2   29
text hello great shine        2   29

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM