简体   繁体   中英

Python Package with NLTK as a Dependency

I've looked around for a question pertaining to this without any hits, so here we go:

I am working on a toy python package to deploy on PyPi.org. A part of its job involves streamlining the process of parsing text and generating tokenized sentences. Naturally, I have considered using nltk for the job, having personally used tools like punkt from the package.

Here's the problem and my question: Having looked at the size of nltk and the requirements for it to work, with the corpora nearly 10 gigabytes in size, I've come to the conclusion this is an outlandish burden to put on anyone who wants to use my package given its use-case.

Is there anyway to deploy a "pre-trained" instance of punkt ? Or can I control the size of the corpora used by nltk ?

I am equally open to an alternative package/solution for parsing relatively "sane" human text that is somewhat close to performance of nltk but without the same disk memory footprint.

Thanks for any help.


solution as indicated below by @matisetorm for me is:

python -m nltk.downloader punkt

Absolutely.

1) You can selectively download corpora like described at Programmatically install NLTK corpora / models, ie without the GUI downloader? For example,

python -m nltk.downloader <your package you would like to download>

2) or using the GUI with instructions at http://www.nltk.org/data.html

Which basically amounts to doing the following and command line

python3
import nltk
nltk.download()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM