Python Package with NLTK as a Dependency

Question

I've looked around for a question pertaining to this without any hits, so here we go:

I am working on a toy python package to deploy on PyPi.org. A part of its job involves streamlining the process of parsing text and generating tokenized sentences. Naturally, I have considered using nltk for the job, having personally used tools like punkt from the package.

Here's the problem and my question: Having looked at the size of nltk and the requirements for it to work, with the corpora nearly 10 gigabytes in size, I've come to the conclusion this is an outlandish burden to put on anyone who wants to use my package given its use-case.

Is there anyway to deploy a "pre-trained" instance of punkt ? Or can I control the size of the corpora used by nltk ?

I am equally open to an alternative package/solution for parsing relatively "sane" human text that is somewhat close to performance of nltk but without the same disk memory footprint.

Thanks for any help.

solution as indicated below by @matisetorm for me is:

python -m nltk.downloader punkt

Answer 1

Absolutely.

1) You can selectively download corpora like described at Programmatically install NLTK corpora / models, ie without the GUI downloader? For example,

python -m nltk.downloader <your package you would like to download>

2) or using the GUI with instructions at http://www.nltk.org/data.html

Which basically amounts to doing the following and command line

python3
import nltk
nltk.download()

Python Package with NLTK as a Dependency

Question

1 answers

solution1
1 ACCPTED 2018-02-21 01:38:08

Python Package with NLTK as a Dependency

Question

1 answers

solution1 1 ACCPTED 2018-02-21 01:38:08

solution1
1 ACCPTED 2018-02-21 01:38:08