简体   繁体   English

Python中没有词序的双元组频率

[英]Bigram frequency without word order in Python

I have written a program to find the frequency of words in Python.我编写了一个程序来查找 Python 中的单词出现频率。 I am stuck at a place where I need to find the frequency of bigrams without considering the word order.我被困在一个地方,我需要在不考虑词序的情况下找到二元词的频率。 That means " in the" should be counted same as "the in".这意味着“在”应该与“在”相同。 Code to find bigram frequency:查找双元组频率的代码:

txt = open('txt file', 'r') 
finder1 = BigramCollocationFinder.from_words(txt.read().split(),window_size = 3)
finder1.apply_freq_filter(3)
bigram_measures = nltk.collocations.BigramAssocMeasures()

for k,v in sorted(list(combinations((set(finder1.ngram_fd.items())),2)),key=lambda t:t[-1], reverse=True)[:10]:
    print(k,v)

This seems like somewhere you could use sets for the keys in a Counter .这似乎是您可以在Counter 中使用set键的地方。 You can see from the linked docs that sets are unordered containers and Counters are dictionaries that are specialized for counting occurrences of objects in an iterable.您可以从链接的文档中看到,集合是无序的容器,而计数器是专门用于计算可迭代对象中出现的次数的字典。 Could look something like this:可能看起来像这样:

from string import punctuation as punct

with open('txt file.txt') as txt:
    doc = txt.read().translate({c: '' for c in punct}).split()

c = Counter()

c.update(fronzenset((doc[i], doc[i+1])) for i in range(len(doc) - 1))

The with statement handles the file, then automatically closes the connection. with语句处理文件,然后自动关闭连接。 From there it reads it into list of words separated by whitespace characters (spaces, newlines, etc...).从那里它将它读入由空格字符(空格、换行符等)分隔的单词列表。 Then it initializes the Counter and counts unordered pairs of words in the string.然后它初始化计数器并计算字符串中无序的单词对。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM