统计某列所有行中2个词组合的频率

Question

I want to count the frequency of 2 words combination in all the rows of a column.我想计算一列所有行中 2 个单词组合的频率。

I have a table with two columns - The first is a column with a sentence while the other is the bigram tokenization of that sentence.我有一个包含两列的表 - 第一列是一个句子，而另一列是该句子的二元标记化。

Sentence句子	words字
'beautiful day suffered through ' '度过美好的一天'	'beautiful day' '美好的一天'
'beautiful day suffered through ' '度过美好的一天'	'day suffered' “受苦的日子”
'beautiful day suffered through ' '度过美好的一天'	'suffered through' '经历'
'cannot hold back tears ' '忍不住流泪'	'cannot hold' '保持不住'
'cannot hold back tears ' '忍不住流泪'	'hold back' '憋'
'cannot hold back tears ' '忍不住流泪'	'back tears' '背泪'
'ash back tears beautiful day ' '灰背泪美丽的日子'	'ash back' '灰背'
'ash back tears beautiful day ' '灰背泪美丽的日子'	'back tears' '背泪'
'ash back tears beautiful day ' '灰背泪美丽的日子'	'tears beautiful' '眼泪美丽'
'ash back tears beautiful day ' '灰背泪美丽的日子'	'beautiful day' '美好的一天'

My desired output is a column counting the frequency of the words in all the sentences throughout the whole df['Sentence'] column.我想要的 output 是一个列，用于计算整个 df['Sentence'] 列中所有句子中单词的出现频率。 Something like this:像这样：

Sentence句子	Words字	Total全部的
'beautiful day suffered through ' '度过美好的一天'	'beautiful day' '美好的一天'	2 2个
'beautiful day suffered through ' '度过美好的一天'	'day suffered' “受苦的日子”	1 1个
'beautiful day suffered through ' '度过美好的一天'	'suffered through' '经历'	1 1个
'cannot hold back tears ' '忍不住流泪'	'cannot hold' '保持不住'	1 1个
'cannot hold back tears ' '忍不住流泪'	'hold back' '憋'	1 1个
'cannot hold back tears ' '忍不住流泪'	'back tears' '背泪'	2 2个
'ash back tears beautiful day ' '灰背泪美丽的日子'	'ash back' '灰背'	1 1个
'ash back tears beautiful day ' '灰背泪美丽的日子'	'back tears' '背泪'	2 2个
'ash back tears beautiful day ' '灰背泪美丽的日子'	'tears beautiful' '眼泪美丽'	1 1个
'ash back tears beautiful day ' '灰背泪美丽的日子'	'beautiful day' '美好的一天'	2 2个

and so on.等等。

The code I have tried repeats the first same frequency until the end of the sentence.我试过的代码重复第一个相同的频率，直到句子结束。

df.Sentence.str.count('|'.join(df.words.tolist()))

So not what I am looking for and it also takes a very long time as my original df is much larger.所以这不是我想要的，而且它也需要很长时间，因为我原来的 df 更大。

Is there any alternative or any function in the NLTK or any other library? NLTK 或任何其他库中是否有任何替代方案或任何 function？

Answer 1

I suggest:我建议：

Start by removing the quotes and whitespaces at the beginning and end of both Sentences and words首先删除Sentences和words开头和结尾的引号和空格

data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()

Then set Sentences and words as string objects:然后将Sentences和words设置为字符串对象：

data = data.astype({"Sentence":str, "words": str})
print(data)

#Output
                          Sentence            words
0   beautiful day suffered through     beautiful day
1   beautiful day suffered through      day suffered
2   beautiful day suffered through  suffered through
3           cannot hold back tears       cannot hold
4           cannot hold back tears         hold back
5           cannot hold back tears        back tears
6     ash back tears beautiful day          ash back
7     ash back tears beautiful day        back tears
8     ash back tears beautiful day   tears beautiful
9     ash back tears beautiful day     beautiful day

Count the occurrence of the given words in the sentence on the same row and store in a column eg words_occur计算同一行句子中给定单词的出现次数，并存储在列中，例如words_occur

def words_in_sent(row):
    return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)

Finally groupby words and sum up their occurrences:最后 groupby words并总结它们的出现：

data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)

Result结果

                          Sentence          words    words_occur total
0   beautiful day suffered through     beautiful day           1     2
1   beautiful day suffered through      day suffered           1     1
2   beautiful day suffered through  suffered through           1     1
3           cannot hold back tears       cannot hold           1     1
4           cannot hold back tears         hold back           1     1
5           cannot hold back tears        back tears           1     2
6     ash back tears beautiful day          ash back           1     1
7     ash back tears beautiful day        back tears           1     2
8     ash back tears beautiful day   tears beautiful           1     1
9     ash back tears beautiful day     beautiful day           1     2

Answer 2

The way I understand it is that you want a bi-gram count as contained in each unique sentence.我的理解是你想要一个包含在每个独特句子中的二元词组计数。 The answer for that already exists in the words column.答案已经存在于单词列中。 value_counts() is used to deliver that. value_counts()用于传递它。

df.merge(df['words'].value_counts(), how='left', left_on='words', right_index=True, suffixes=(None,'_total')) 

                           Sentence             words  words_total
0  beautiful day suffered through       beautiful day            2
1  beautiful day suffered through        day suffered            1
2  beautiful day suffered through    suffered through            1
3          cannot hold back tears         cannot hold            1
4          cannot hold back tears           hold back            1
5          cannot hold back tears          back tears            2
6    ash back tears beautiful day            ash back            1
7    ash back tears beautiful day          back tears            2
8    ash back tears beautiful day     tears beautiful            1
9    ash back tears beautiful day       beautiful day            2

统计某列所有行中2个词组合的频率

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-05-04 19:15:11

解决方案2
0 2022-05-04 16:35:34

统计某列所有行中2个词组合的频率

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-05-04 19:15:11

解决方案2 0 2022-05-04 16:35:34

解决方案1
1 已采纳 2022-05-04 19:15:11

解决方案2
0 2022-05-04 16:35:34