Using NLTK corpora with AWS Lambda functions in Python

Question

I'm encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I'm aware that the corpora need to be downloaded and have done so with NLTK.download('stopwords') and included them in the zip file used to upload the lambda modules in nltk_data/corpora/stopwords.

The usage in the code is as follows:

from nltk.corpus import stopwords
stopwords = stopwords.words('english')
nltk.data.path.append("/nltk_data")

This returns the following error from the Lambda log output

module initialization error: 
**********************************************************************
  Resource u'corpora/stopwords' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/sbx_user1062/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data'
**********************************************************************

I have also tried to load the data directly by including

nltk.data.load("/nltk_data/corpora/stopwords/english")

Which yields a different error below

module initialization error: Could not determine format for file:///stopwords/english based on its file
extension; use the "format" argument to specify the format explicitly.

It's possible that it has a problem loading the data from the Lambda zip and needs it stored externally.. say on S3, but that seems a bit strange.

Any idea what format the

Does anyone know where I could be going wrong?

Answer 1

I had the same problem before but I solved it using the environment variable.

Execute "nltk.download()" and copy it to the root folder of your AWS lambda application. (The folder should be called "nltk_data".)

You can use following code for that

import nltk
nltk.download('punkt', download_dir='nltk_data/')

This will download 'punkit' to your root dir then put below in your dockerfile

COPY nltk_data ./nltk_data

In the user interface of your lambda function (in the AWS console), you add "NLTK_DATA" = "./nltk_data". Please see the image.

Answer 2

Another solution is to use Lambda's ephemeral storage at the location /tmp

So, you would have something like this:

import nltk
import json
from nltk.tokenize import word_tokenize

nltk.data.path.append("/tmp")

nltk.download("punkt", download_dir = "/tmp")

At runtime punkt will download to the /tmp directory, which is writable. However, this likely isn't a great solution if you have huge concurrency.

Answer 3

on AWS Lambda you need to include nltk python package with lambda and modify data.py:

path += [
    str('/usr/share/nltk_data'),
    str('/usr/local/share/nltk_data'),
    str('/usr/lib/nltk_data'),
    str('/usr/local/lib/nltk_data')
]

to

path += [
    str('/var/task/nltk_data')
    #str('/usr/share/nltk_data'),
    #str('/usr/local/share/nltk_data'),
    #str('/usr/lib/nltk_data'),
    #str('/usr/local/lib/nltk_data')
]

You cant include the entire nltk_data directory, delete all the zip files, and if you only need stopwords, save nltk_data -> corpora -> stopwords and dump the rest. If you need tokenizers save nltk_data -> tokenizers -> punkt. To download the nltk_data folder use anaconda Jupyter notebook and run

nltk.download()

or

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip

or

python -m nltk.downloader all

Answer 4

If your stopwords corpus is under /nltk_data (based on root, not under your home directory), you need to tell the nltk before you try to access a corpus:

from nltk.corpus import stopwords
nltk.data.path.append("/nltk_data")

stopwords = stopwords.words('english')

Using NLTK corpora with AWS Lambda functions in Python

Question

4 answers

solution1
16 2017-08-28 21:35:28

solution2
13 2019-03-20 01:48:57

solution3
2 2017-07-12 23:47:05

solution4
1 2017-02-22 11:23:21

Using NLTK corpora with AWS Lambda functions in Python

Question

4 answers

solution1 16 2017-08-28 21:35:28

solution2 13 2019-03-20 01:48:57

solution3 2 2017-07-12 23:47:05

solution4 1 2017-02-22 11:23:21

solution1
16 2017-08-28 21:35:28

solution2
13 2019-03-20 01:48:57

solution3
2 2017-07-12 23:47:05

solution4
1 2017-02-22 11:23:21