简体   繁体   English

希腊语的无上下文语法

[英]Context-free grammar for Greek

I want to create a very simple context-free grammar for Greek language, using nltk . 我想使用nltk为希腊语言创建一个非常简单的无上下文语法。 I run Python 2.7 on Windows. 我在Windows上运行Python 2.7。

Here's my code: 这是我的代码:

# -*- coding: utf-8 -*-
import nltk
grammar = nltk.CFG.fromstring("""
            S -> Verb Noun
            Verb -> a
            Noun -> b
            """)
a="κάνω"
b="ποδήλατο"

user_input = "κάνω ποδήλατο"

How can I tell if the user_input is grammatically correct? 如何判断user_input在语法上是否正确? I tried: 我试过了:

sent =  user_input.split()
parser = nltk.ChartParser(grammar)
for tree in parser.parse(sent):
        print tree

but I get the following error, which occurs in the grammar.py file (line 632), that comes with nltk : 但我得到了下面的错误,这发生在grammar.py文件(行632),附带nltk

ValueError: Grammar does not cover some of the input words: u"'\\xce\\xba\\xce\\xac\\xce\\xbd\\xcf\\x89', '\\xcf\\x80\\xce\\xbf\\xce\\xb4\\xce\\xae\\xce\\xbb\\xce\\xb1\\xcf\\x84\\xce\\xbf'".

I only get the error when I use the for loop. 我仅在使用for循环时收到错误。 Until that point I receive no error. 在那之前,我没有收到任何错误。 So I suppose it's some kind of encoding problem which I don't know how to overcome. 所以我想这是某种编码问题,我不知道该如何克服。

Firstly, you have to declare the non-terminals, ie the words in the lexicon directly into the CFG grammar if you're using nltk.CFG.fromstring : 首先,如果使用nltk.CFG.fromstring ,则必须声明非终结nltk.CFG.fromstring ,即,将词典中的单词直接声明为CFG语法:

import nltk
grammar = nltk.CFG.fromstring(u"""
            S -> Verb Noun
            Verb -> "κάνω"
            Noun -> "ποδήλατο"
            """)
parser = nltk.ChartParser(grammar)
print parser.grammar()

[out]: [OUT]:

Grammar with 3 productions (start state = S)
    S -> Verb Noun
    Verb -> '\u03ba\u03ac\u03bd\u03c9'
    Noun -> '\u03c0\u03bf\u03b4\u03ae\u03bb\u03b1\u03c4\u03bf'

Now we look at your user_input : 现在我们来看一下您的user_input

>>> print ["κάνω ποδήλατο"]
['\xce\xba\xce\xac\xce\xbd\xcf\x89 \xcf\x80\xce\xbf\xce\xb4\xce\xae\xce\xbb\xce\xb1\xcf\x84\xce\xbf']

You realize that the string is read as bytecode in python 2.x but in python 3.x, it would have been utf8 by default. 您意识到该字符串在python 2.x中被读取为字节码,但是在python 3.x中,默认情况下它应该是utf8。 Now look at it as we decode it to utf8: 现在,在将其解码为utf8时查看一下:

>>> print ["κάνω ποδήλατο".decode('utf8')]
[u'\u03ba\u03ac\u03bd\u03c9 \u03c0\u03bf\u03b4\u03ae\u03bb\u03b1\u03c4\u03bf']

Note that u"κάνω ποδήλατο" would have the same effect as "κάνω ποδήλατο".decode('utf8')` in explicitly decoding the string when you're hardcoding some variable. 请注意,当您对某些变量进行硬编码时, u"κάνω ποδήλατο"与“κάνωποδήλατο” .decode('utf8')`具有显着的解码效果。

Now it looks like how the grammar is read with nltk.CFG.fromstring() : 现在看起来就像如何使用nltk.CFG.fromstring()读取语法:

# -*- coding: utf-8 -*-

import nltk
grammar = nltk.CFG.fromstring(u"""
            S -> Verb Noun
            Verb -> "κάνω"
            Noun -> "ποδήλατο"
            """)
parser = nltk.ChartParser(grammar)

user_input = u"κάνω ποδήλατο".split()
sent = user_input
parser = nltk.ChartParser(grammar)

for tree in parser.parse(sent):
    print tree

[out]: [OUT]:

(S (Verb \u03ba\u03b1\u03bd\u03c9) (Noun \u03c0\u03bf\u03b4\u03b7\u03bb\u03b1\u03c4\u03bf))

But i'm not sure whether you see something weird about the output, it's not exactly in unicode but the unicode byte representation: 但是我不确定您是否看到关于输出的怪异内容,它不完全是unicode,而是unicode字节表示形式:

>>> x = '\u03ba\u03b1\u03bd\u03c9'
>>> print x
\u03ba\u03b1\u03bd\u03c9
>>> print x.decode('utf8')
\u03ba\u03b1\u03bd\u03c9
>>> print x.encode('utf8')
\u03ba\u03b1\u03bd\u03c9
>>> x = u'\u03ba\u03b1\u03bd\u03c9'
>>> print x
κανω

You would need to do this to retrieve your original unicode (thanks to @Kasra, see How to retrieve my unicode from the unicode byte representation ): 您需要执行以下操作来检索原始的unicode(由于@Kasra,请参见如何从unicode字节表示中检索我的unicode ):

>>> s='\u03ba\u03b1\u03bd\u03c9'
>>> print unicode(s,'unicode_escape')
κανω

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM