简体   繁体   English

NLTK无法用Unicode字符标记西班牙语文本吗?

[英]Tagging spanish text with Unicode characters not possible with NLTK?

I'm trying to parse some spanish sentences that contain non-ascii characters (mostly accents in words...for instance: película (film), atención (attention), etc). 我正在尝试解析一些包含非ASCII字符的西班牙语句子(主要是单词中的重音……例如:película(电影),atención(注意)等)。

I'm reading the lines from a file encoded with utf-8. 我正在从使用utf-8编码的文件中读取行。 Here is a sample of my script: 这是我的脚本示例:

# -*- coding: utf-8 -*-

import nltk
import sys
from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

f = codecs.open('spanish_sentences', encoding='utf-8')
results_file = codecs.open('tagging_results', encoding='utf-8', mode='w+')

for line in iter(f):

    output_line =  "Current line contents before tagging->" + str(line.decode('utf-8', 'replace'))
    print output_line
    results_file.write(output_line.encode('utf8'))

    output_line = "Unigram tagger->"
    print output_line
    results_file.write(output_line)

    s = line.decode('utf-8', 'replace')
    output_line = tagger.uni.tag(s.split())
    print output_line
    results_file.write(str(output_line).encode('utf8'))

f.close()
results_file.close()

At this line: 在这一行:

output_line = tagger.uni.tag(s.split())

I'm getting this error: 我收到此错误:

/usr/local/lib/python2.7/dist-packages/nltk-2.0.4-py2.7.egg/nltk/tag/sequential.py:138: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return self._context_to_tag.get(context)

Here is some output for a simple sentence: 这是一些简单句子的输出:

Current line contents before tagging->tengo una queja y cada que hablo a atención me dejan en la linea media hora y cortan la llamada!!

Unigram tagger->
[(u'tengo', 'vmip1s0'), (u'una', 'di0fs0'), (u'queja', 'ncfs000'), (u'y', 'cc'), (u'cada', 'di0cs0'), (u'que', 'pr0cn000'), (u'hablo', 'vmip1s0'), (u'a', 'sps00'), (u'atenci\xf3n', None), (u'me', 'pp1cs000'), (u'dejan', 'vmip3p0'), (u'en', 'sps00'), (u'la', 'da0fs0'), (u'linea', None), (u'media', 'dn0fs0'), (u'hora', 'ncfs000'), (u'y', 'cc'), (u'cortan', None), (u'la', 'da0fs0'), (u'llamada!!', None)]

If I understood correctly from this chapter ...the process is correct...I decode the line from utf-8 to Unicode, tag, and then encode from Unicode to utf-8 again...I don't understand this error 如果我从本章正确理解了...该过程是正确的...我将utf-8的行解码为Unicode,标记,然后再次从Unicode编码为utf-8 ...我不理解此错误

Any idea what I'm doing wrong? 知道我在做什么错吗?

Thanks, Alejandro 谢谢,亚历杭德罗

EDIT: found the problem...basically the spanish cess_esp corpus is encoded with Latin-2 encoding. 编辑:发现了问题...基本上,西班牙cess_esp语料库是使用Latin-2编码进行编码的。 See the code below to see how to be able to train the tagger correctly. 请参阅下面的代码,以了解如何正确训练标记器。

tagged_sents = (
[(word.decode('Latin2'), tag) for (word, tag) in sent]
for sent in cess.tagged_sents()
)
tagger = UT(tagged_sents)  # training a tagger

A better way would be to use the CorpusReader class to ask for the corpus encoding, thus you don't need to know it before-hand. 更好的方法是使用CorpusReader类询问语料库编码,因此您无需事先知道它。

Possibly something is wrong with your tagger object or how your file is read. 标记器对象或文件读取方式可能有问题。 I re-wrote part of your code and it runs without error: 我重新编写了部分代码,它运行无误:

# -*- coding: utf-8 -*-

import urllib2, codecs

from nltk.corpus import cess_esp as cess
from nltk import word_tokenize
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

tagger = ut(cess.tagged_sents())

url = 'https://db.tt/42Lt5M5K'
fin = urllib2.urlopen(url).read().strip().decode('utf8')
fout = codecs.open('tagger.out', 'w', 'utf8')
for line in fin.split('\n'):
    print>>fout, "Current line contents before tagging->", line
    print>>fout, "Unigram tagger->",
    print>>fout, tagger.tag(word_tokenize(line))
    print>>fout, ""

[out]: [OUT]:

http://pastebin.com/n0NK574a http://pastebin.com/n0NK574a

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM