[英]Sum of the values in specific rows in dataframe
I have a dataframe called 'test' like this:我有一个名为“测试”的 dataframe,如下所示:
import pandas as pd
from collections import Counter
from nltk import ngrams
data = [['john tom hello text shine bright', 10], ['random text hello text shine bright', 15], ['random text hello text shine bright juli', 14],
['random text hello great shine bright', 15], ['random text hello great shine bright juli', 14]]
df = pd.DataFrame(data, columns=['Text', 'Value'])
df
Text文本 | Value价值 |
---|---|
john tom hello text shine bright约翰汤姆你好文本闪耀明亮 | 34 34 |
random text hello text shine bright随机文本你好文本闪耀明亮 | 42 42 |
random text hello text shine bright juli随机文字你好文字闪耀明亮juli | 42 42 |
random text hello great shine bright随机文本你好伟大闪耀明亮 | 42 42 |
random text hello great shine bright juli随机文字你好伟大的光芒明亮的朱莉 | 42 42 |
I have the following code which looks for the most common phrase of 4 words, that looks like this:我有以下代码查找最常见的 4 个单词的短语,如下所示:
vals_df_1 = [y for x in df['Text'] for y in x.split()]
c_fourgrams = Counter([' '.join(y) for x in [4] for y in ngrams(vals_df_1, x)])
df_1_fourgrams = pd.DataFrame({'ngrams': list(c_fourgrams.keys()),
'count': list(c_fourgrams.values())})
df_1_fourgrams = df_1_fourgrams.sort_values('count', ascending=False)
df_1_fourgrams = df_1_fourgrams.head()
df_1_fourgrams
Then the dataframe 'df_1_fourgrams' looks like this:然后 dataframe 'df_1_fourgrams' 看起来像这样:
ngrams ngram | count数数 |
---|---|
hello text shine bright你好文字闪耀明亮 | 3 3 |
shine bright random text闪耀明亮的随机文字 | 3 3 |
bright random text hello明亮的随机文字你好 | 3 3 |
random text hello great随机文本你好 | 2 2 |
text shine bright random文字闪耀明亮随机 | 2 2 |
What I am missing is for each phrase to have a sum of the Value column.我缺少的是每个短语都有 Value 列的总和。 If it finds the phrase 'most common phrase is' in 5 rows, then I need to sum all of the values from the Value column in those 5 rows.如果它在 5 行中找到短语“最常见的短语是”,那么我需要对这 5 行中 Value 列中的所有值求和。
The resulting dataframe would look something like this:生成的 dataframe 看起来像这样:
ngrams ngram | count数数 | Value sum价值总和 |
---|---|---|
hello text shine bright你好文字闪耀明亮 | 3 3 | 118 118 |
shine bright random text闪耀明亮的随机文字 | 3 3 | 118 118 |
bright random text hello明亮的随机文字你好 | 3 3 | 118 118 |
random text hello great随机文本你好 | 2 2 | 84 84 |
text shine bright random文字闪耀明亮随机 | 2 2 | 84 84 |
Is this possible?这可能吗? How could I do this?我怎么能这样做?
You can achieve everything with pandas from the original dataframe:您可以使用原始 dataframe 中的 pandas 实现一切:
out = (df
.assign(words=[[' '.join(x) for x in ngrams(s.split(), 4)]
for s in df['Text']])
.explode('ngrams')
.groupby('ngrams')['Value']
.agg(['count', 'sum'])
.sort_values('count', ascending=False)
.head(5)
)
output: output:
count sum
ngrams
hello text shine bright 3 39
hello great shine bright 2 29
random text hello great 2 29
random text hello text 2 29
text hello great shine 2 29
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.