简体   繁体   English

NLP nltk使用自定义语法

[英]NLP nltk using the custom grammar

Hi let's imagine i have a grammar like this S-> NNP VBZ NNP . 嗨,让我们想象一下我有一个语法,例如S-> NNP VBZ NNP。 However the number of NNPs are huge and its in a file. 但是,NNP的数量巨大,并且存在文件中。 How can I load that directly into grammar or how can I make sure that the grammar fetches the words from the corpus instead of specifying all the words ? 如何将其直接加载到语法中,或者如何确保语法从语料库中获取单词,而不是指定所有单词?

Assuming each POS has its own text file consisting of every possible word with that tag on a separate line, you just want to make a dictionary by reading in the lines: 假设每个POS都有自己的文本文件,该文本文件由每个可能的单词组成,并在单独的行上带有该标签,您只想通过阅读以下行来制作字典:

lexicon = {}
with open('path/to/the/files/NNP.txt', 'r') as NNP_File: 
    # 'with' automatically closes the file once you're done
    # now update the 'NNP' key in your lexicon with every word in the file.
    # a set seems like a good idea but it depends on your purposes
    lexicon['NNP'] = set(NNP_File.readlines())

This setup is good for checking if some word can be of a specified part of speech; 此设置非常适合检查某些单词是否可以属于语音的指定部分; you could also flip it around and make the words the keys, if that's better for what you're building: 您也可以将其翻转,然后将单词作为键,如果这样对您正在构建的内容更好:

for word in NNP_File.readlines():
    if lexicon.has_key(word):
        lexicon[word].update(['NNP'])
    else:
        lexicon[word] = set(['NNP'])

If your text files are formatted a different way, you'll need to take a different approach. 如果文本文件的格式不同,则需要采取其他方法。 EDIT To produce a grammar line in the format you mentioned, you could follow that first approach above with something like, 编辑要以您提到的格式产生语法行,您可以按照上述第一种方法进行操作,

with open('path/NNP.txt', 'r') as f:
    NNP_terminal_rule = 'NNP -> ' + '|'.join(f) 
    # str.join() takes an iterable, so the file object works here.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM