简体   繁体   中英

Python NLTK Tagging AssertionError

I'm running into an odd assertion error when using NLTK to process around 5000 posts with the PlainTextCorpusReader. With some of our datasets we don't have any major issues. However, on the rare occasion I'm met with:

File "/home/cp-staging/environs/cpstaging/lib/python2.5/site-packages/nltk/tag/api.py", line 51, in batch_tag
return [self.tag(sent) for sent in sentences]
File "nltk/corpus/reader/util.py", line 401, in iterate_from
File "nltk/corpus/reader/util.py", line 343, in iterate_from
AssertionError

My code works (basically) like so:

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
tag0 = ArcBaseTagger('NN')
tag1 = nltk.UnigramTagger(brown_tagged_sents, backoff=tag0)
posts = PlaintextCorpusReader(posts_path, '.*')
tagger = nltk.BigramTagger(brown_tagged_sents, backoff=tag1)
tagged_sents = tagger.batch_tag(posts.sents())

It seems like nltk is losing its place in the file buffer, but I'm not 100% on that. Any idea what might cause this to happen? It almost seems like it has to have something to do with the data I'm processing. Maybe some funky characters?

I also faced this problem when one write function was making my corpora empty. making sure the file we are reading is not empty can avoid this error.

从解析中删除了一些空文件,问题已解决。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM