[英]finding bigrams in unicode text with nltk
我正在尝试在Unicode文本中找到最常见的二元组。 这是我正在使用的代码:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk
from nltk.collocations import *
import codecs
line = ""
open_file = codecs.open('s.txt', 'r', encoding='utf-8').read()
for val in open_file:
line += val.lower()
tokens = line.split()
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(1)
a = finder.ngram_fd.viewitems()
for i,j in a:
print i,j
s.txt
文件包含以下文本: çalışmak naber çösd bfkd
这是输出:
(u'\xe7\xf6sd', u'bfkd') 1
(u'naber', u'\xe7\xf6sd') 1
(u'\xe7al\u0131\u015fmak', u'naber') 1
但我想要这种格式:
çalışmak naber 1
naber çösd 1
çösd bfkd 1
我该如何解决这个unicode问题?
您需要显式打印元组的元素,而不是整个元组。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk
from nltk.collocations import *
import codecs
line = ""
open_file = codecs.open('s.txt', 'r', encoding='utf-8').read()
for val in open_file:
line += val.lower()
tokens = line.split()
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(1)
a = finder.ngram_fd.viewitems()
for i, j in a:
print("{0} {1} {2}".format(i[0], i[1], j))
test.py
l = [((u'\xe7\xf6sd', u'bfkd'), 1), ((u'naber', u'\xe7\xf6sd'), 1), ((u'\xe7al\u0131\u015fmak', u'naber'), 1)]
for i, j in l:
print("{0} {1} {2}".format(i[0], i[1], j))
正在运行:
14:58 $ python test.py
çösd bfkd 1
naber çösd 1
çalışmak naber 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.