[英]Count the frequency of 2 words combination in all the rows of a column
I want to count the frequency of 2 words combination in all the rows of a column.我想计算一列所有行中 2 个单词组合的频率。
I have a table with two columns - The first is a column with a sentence while the other is the bigram tokenization of that sentence.我有一个包含两列的表 - 第一列是一个句子,而另一列是该句子的二元标记化。
Sentence句子 | words字 |
---|---|
'beautiful day suffered through ' '度过美好的一天' | 'beautiful day' '美好的一天' |
'beautiful day suffered through ' '度过美好的一天' | 'day suffered' “受苦的日子” |
'beautiful day suffered through ' '度过美好的一天' | 'suffered through' '经历' |
'cannot hold back tears ' '忍不住流泪' | 'cannot hold' '保持不住' |
'cannot hold back tears ' '忍不住流泪' | 'hold back' '憋' |
'cannot hold back tears ' '忍不住流泪' | 'back tears' '背泪' |
'ash back tears beautiful day ' '灰背泪美丽的日子' | 'ash back' '灰背' |
'ash back tears beautiful day ' '灰背泪美丽的日子' | 'back tears' '背泪' |
'ash back tears beautiful day ' '灰背泪美丽的日子' | 'tears beautiful' '眼泪美丽' |
'ash back tears beautiful day ' '灰背泪美丽的日子' | 'beautiful day' '美好的一天' |
My desired output is a column counting the frequency of the words in all the sentences throughout the whole df['Sentence'] column.我想要的 output 是一个列,用于计算整个 df['Sentence'] 列中所有句子中单词的出现频率。 Something like this:像这样:
Sentence句子 | Words字 | Total全部的 |
---|---|---|
'beautiful day suffered through ' '度过美好的一天' | 'beautiful day' '美好的一天' | 2 2个 |
'beautiful day suffered through ' '度过美好的一天' | 'day suffered' “受苦的日子” | 1 1个 |
'beautiful day suffered through ' '度过美好的一天' | 'suffered through' '经历' | 1 1个 |
'cannot hold back tears ' '忍不住流泪' | 'cannot hold' '保持不住' | 1 1个 |
'cannot hold back tears ' '忍不住流泪' | 'hold back' '憋' | 1 1个 |
'cannot hold back tears ' '忍不住流泪' | 'back tears' '背泪' | 2 2个 |
'ash back tears beautiful day ' '灰背泪美丽的日子' | 'ash back' '灰背' | 1 1个 |
'ash back tears beautiful day ' '灰背泪美丽的日子' | 'back tears' '背泪' | 2 2个 |
'ash back tears beautiful day ' '灰背泪美丽的日子' | 'tears beautiful' '眼泪美丽' | 1 1个 |
'ash back tears beautiful day ' '灰背泪美丽的日子' | 'beautiful day' '美好的一天' | 2 2个 |
and so on.等等。
The code I have tried repeats the first same frequency until the end of the sentence.我试过的代码重复第一个相同的频率,直到句子结束。
df.Sentence.str.count('|'.join(df.words.tolist()))
So not what I am looking for and it also takes a very long time as my original df is much larger.所以这不是我想要的,而且它也需要很长时间,因为我原来的 df 更大。
Is there any alternative or any function in the NLTK or any other library? NLTK 或任何其他库中是否有任何替代方案或任何 function?
I suggest:我建议:
Sentences
and words
首先删除Sentences
和words
开头和结尾的引号和空格data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()
Sentences
and words
as string objects:然后将Sentences
和words
设置为字符串对象:data = data.astype({"Sentence":str, "words": str})
print(data)
#Output
Sentence words
0 beautiful day suffered through beautiful day
1 beautiful day suffered through day suffered
2 beautiful day suffered through suffered through
3 cannot hold back tears cannot hold
4 cannot hold back tears hold back
5 cannot hold back tears back tears
6 ash back tears beautiful day ash back
7 ash back tears beautiful day back tears
8 ash back tears beautiful day tears beautiful
9 ash back tears beautiful day beautiful day
words_occur
计算同一行句子中给定单词的出现次数,并存储在列中,例如words_occur
def words_in_sent(row):
return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)
words
and sum up their occurrences:最后 groupby words
并总结它们的出现:data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)
Result结果
Sentence words words_occur total
0 beautiful day suffered through beautiful day 1 2
1 beautiful day suffered through day suffered 1 1
2 beautiful day suffered through suffered through 1 1
3 cannot hold back tears cannot hold 1 1
4 cannot hold back tears hold back 1 1
5 cannot hold back tears back tears 1 2
6 ash back tears beautiful day ash back 1 1
7 ash back tears beautiful day back tears 1 2
8 ash back tears beautiful day tears beautiful 1 1
9 ash back tears beautiful day beautiful day 1 2
The way I understand it is that you want a bi-gram count as contained in each unique sentence.我的理解是你想要一个包含在每个独特句子中的二元词组计数。 The answer for that already exists in the words column.答案已经存在于单词列中。 value_counts()
is used to deliver that. value_counts()
用于传递它。
df.merge(df['words'].value_counts(), how='left', left_on='words', right_index=True, suffixes=(None,'_total'))
Sentence words words_total
0 beautiful day suffered through beautiful day 2
1 beautiful day suffered through day suffered 1
2 beautiful day suffered through suffered through 1
3 cannot hold back tears cannot hold 1
4 cannot hold back tears hold back 1
5 cannot hold back tears back tears 2
6 ash back tears beautiful day ash back 1
7 ash back tears beautiful day back tears 2
8 ash back tears beautiful day tears beautiful 1
9 ash back tears beautiful day beautiful day 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.