繁体   English   中英

nltk中Unicode语料库中的一致Unicode字符

[英]Concordance Unicode characters in Unicode corpus in nltk

我有Unicode短语想在我的Unicode语料库中搜索nltk但问题是我应该在nltk转换我的编码或我的索引结果将为零。 但我不知道怎么办? 这是我的简单代码:

import nltk
f=open('word-freq-utf8-new.txt','rU')
text=f.read()
text1=text.split()
abst=nltk.Text(text1)
abst.concordance('سلام')

nltk在使用unicode时还不能很好地工作,尽管他们正在研究它。 作为一个快速修复,您可以为一致性创建子类并覆盖print_concordance方法,以确保您在正确的时间进行编码/解码以进行处理和显示。 这是一个非常快速的解决方案,假设您已经导入了nltk(我使用的是unicode希腊语文本的一部分):

>>> tokens = re.findall(ur'\w+', t.decode('utf-8'), flags=re.U)    # I did this to make sure I was working with a decoded text. If you are working with an encoded text, skip this. `t` is the equivalent of your `text`.

>>> class ConcordanceIndex2(nltk.ConcordanceIndex):
    'Extends the ConcordanceIndex class.'
    def print_concordance(self, word, width=75, lines=25):
        half_width = (width - len(word) - 2) // 2
        context = width // 4 # approx number of words of context

        offsets = self.offsets(word)
        if offsets:
            lines = min(lines, len(offsets))
            print("Displaying %s of %s matches:" % (lines, len(offsets)))
            for i in offsets:
            if lines <= 0:
                break
            left = (' ' * half_width +
                ' '.join([x.decode('utf-8') for x in self._tokens[i-context:i]]))    # decoded here for display purposes
            right = ' '.join([x.decode('utf-8') for x in self._tokens[i+1:i+context]])    # decoded here for display purposes
            left = left[-half_width:]
            right = right[:half_width]
            print(' '.join([left, self._tokens[i].decode('utf-8'), right]))    # decoded here for display purposes
            lines -= 1
        else:
            print("No matches")

如果您正在使用已解码的文本,则需要对令牌进行编码,如下所示:

>>> concordance_index = ConcordanceIndex2([x.encode('utf-8') for x in tokens], key=lambda s: s.lower())    # encoded here to match an encoded text
>>> concordance_index.print_concordance(u'\u039a\u0391\u0399\u03a3\u0391\u03a1\u0395\u0399\u0391\u03a3'.encode('utf-8'))
Displaying 1 of 1 matches:
                           ΚΑΙΣΑΡΕΙΑΣ ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΙΣΤΟΡΙΑ Euse

否则,你可以这样做:

>>> concordance_index = ConcordanceIndex2(tokens, key=lambda s: s.lower())
>>> concordance_index.print_concordance('\xce\x9a\xce\x91\xce\x99\xce\xa3\xce\x91\xce\xa1\xce\x95\xce\x99\xce\x91\xce\xa3')
Displaying 1 of 1 matches:
                           ΚΑΙΣΑΡΕΙΑΣ ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΙΣΤΟΡΙΑ Euse

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM