[英]Python Pandas NLTK: Show Frequency of Common Phrases (ngrams) From Text Field in Dataframe Using BigramCollocationFinder
I have the following sample tokenized data frame: 我有以下示例标记化数据框:
No category problem_definition_stopwords
175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438 ['galley', 'work', 'table', 'stuck']
912 2698 ['cloth', 'stuck']
572 2521 ['stuck', 'coffee']
I ran the code below successfully to get out ngram phrases. 我成功运行了下面的代码以找出ngram短语。
finder = BigramCollocationFinder.from_documents(df['problem_definition_stopwords'])
# only bigrams that appear 1+ times
finder.apply_freq_filter(1)
# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)
The results are shown below with top 10 pmi: 结果显示如下,前10个pmi:
[('brewing', 'properly'), ('galley', 'work'), ('maker', 'brewing'), ('properly', '2'), ('work', 'table'), ('coffee', 'maker'), ('2', '420'), ('cloth', 'stuck'), ('table', 'stuck'), ('420', '420')]
I want the above result to appear in a data frame containing frequency counts showing how often those bigrams occurred. 我希望上面的结果出现在一个包含频率计数的数据帧中,该频率计数显示了这些二元数据发生的频率。
Sample desired output: 所需输出样本:
ngram frequency
'brewing', 'properly' 1
'galley', 'work' 1
'maker', 'brewing' 1
'properly', '2' 1
... ...
How do I do the above in Python? 如何在Python中执行以上操作?
This should do it... 这应该做...
First, set up your dataset (or a similar one): 首先,设置您的数据集(或类似数据集):
import pandas as pd
from nltk.collocations import *
import nltk.collocations
from nltk import ngrams
from collections import Counter
s = pd.Series(
[
['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'],
['galley', 'work', 'table', 'stuck'],
['cloth', 'stuck'],
['stuck', 'coffee']
]
)
finder = BigramCollocationFinder.from_documents(s.values)
bigram_measures = nltk.collocations.BigramAssocMeasures()
# only bigrams that appear 1+ times
finder.apply_freq_filter(1)
# return the 10 n-grams with the highest PMI
result = finder.nbest(bigram_measures.pmi, 10)
Use nltk.ngrams
to recreate the ngrams list: 使用nltk.ngrams
重新创建nltk.ngrams
列表:
ngram_list = [pair for row in s for pair in ngrams(row, 2)]
Use collections.Counter
to count the number of times each ngram appears across the entire corpus: 使用collections.Counter
来计算每个ngram在整个语料库中出现的次数:
counts = Counter(ngram_list).most_common()
Build a DataFrame that looks like what you want: 构建一个看起来像您想要的数据框:
pd.DataFrame.from_records(counts, columns=['gram', 'count'])
gram count
0 (420, 420) 2
1 (coffee, maker) 1
2 (maker, brewing) 1
3 (brewing, properly) 1
4 (properly, 2) 1
5 (2, 420) 1
6 (galley, work) 1
7 (work, table) 1
8 (table, stuck) 1
9 (cloth, stuck) 1
10 (stuck, coffee) 1
You can then filter to look at only those ngrams produced by your finder.nbest
call: 然后,您可以过滤以仅查看finder.nbest
调用生成的ngram。
df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
df[df['gram'].isin(result)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.