简体   繁体   English

如何在python中编写字母bigram(aa,ab,bc,cd…zz)频率分析计数器?

[英]How to write a alphabet bigram (aa, ab, bc, cd … zz) frequency analysis counter in python?

This is my current code which prints out the frequency of each character in the input file. 这是我当前的代码,它打印出输入文件中每个字符的频率。

from collections import defaultdict

counters = defaultdict(int)
with open("input.txt") as content_file:
   content = content_file.read()
   for char in content:
       counters[char] += 1

for letter in counters.keys():
    print letter, (round(counters[letter]*100.00/1234,3)) 

I want it to print the frequency of bigrams of only the alphabets(aa,ab,ac ..zy,zz) and not the punctuation as well. 我希望它只打印字母(aa,ab,ac ..zy,zz)的双字母组的频率,而不要打印标点符号。 How to do this? 这个怎么做?

You can build around the current code to handle pairs as well. 您可以围绕当前代码进行构建以处理对。 Keep track of 2 characters instead of just 1 by adding another variable, and use a check to eliminate non alphabets. 通过添加另一个变量来跟踪2个字符而不是仅仅1个字符,并使用检查来消除非字母。

from collections import defaultdict

counters = defaultdict(int)
paired_counters = defaultdict(int)
with open("input.txt") as content_file:
   content = content_file.read()
   prev = '' #keeps track of last seen character
   for char in content:
       counters[char] += 1
       if prev and (prev+char).isalpha(): #checks for alphabets.
           paired_counters[prev+char] += 1
       prev = char #assign current char to prev variable for next iteration

for letter in counters.keys(): #you can iterate through both keys and value pairs from a dictionary instead using .items in python 3 or .iteritems in python 2.
    print letter, (round(counters[letter]*100.00/1234,3)) 

for pairs,values in paired_counters.iteritems(): #Use .items in python 3. Im guessing this is python2.
    print pairs, values

(disclaimer: i do not have python 2 on my system. if there is an issue in the code let me know.) (免责声明:我的系统上没有python2。如果代码中存在问题,请通知我。)

There is a more efficient way of counting bigraphs: with a Counter . 有一种更有效的方法来统计二部图:使用Counter Start by reading the text (assuming it is not too large): 首先阅读文本(假设文本不太大):

from collections import Counter
with open("input.txt") as content_file:
   content = content_file.read()

Filter out non-letters: 过滤掉非字母:

letters = list(filter(str.isalpha, content))

You probably should convert all letters to the lower case, too, but it's up to you: 您可能也应该将所有字母都转换为小写字母,但这取决于您:

letters = letters.lower()    

Build a zip of the remaining letters with itself, shifted by one position, and count the bigraphs: 用剩余的字母建立一个zip文件,将其移动一个位置,然后计算两图:

cntr = Counter(zip(letters, letters[1:]))

Normalize the dictionary: 规范字典:

total = len(cntr)
{''.join(k): v / total for k,v in cntr.most_common()}
#{'ow': 0.1111111111111111, 'He': 0.05555555555555555...}

The solution can be easily generalized to trigraphs, etc., by changing the counter: 通过更改计数器,可以轻松地将解决方案推广到三边形等。

cntr = Counter(zip(letters, letters[1:], letters[2:]))

If you're using nltk : 如果您使用的是nltk

from nltk import ngrams
list(ngrams('hello', n=2))

[out]: [OUT]:

[('h', 'e'), ('e', 'l'), ('l', 'l'), ('l', 'o')]

To do a count: 进行计数:

from collections import Counter
Counter(list(ngrams('hello', n=2)))

If you want a python native solution, take a look at: 如果您想要python本机解决方案,请查看:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM