简体   繁体   中英

NLTK corpora download is hanging when run in AWS Lambda Python function

I'm trying to download NLTK data onto the file storage of a Lambda function like so:

nltk.data.path.append("/tmp")
nltk.download("popular", download_dir="/tmp")

The Lambda function keeps timing out. When I check the Cloudwatch logs, I see no logs related to the download of different corpora files (eg Downloading package cmudict to /tmp... ; instead the code seems to reach up to nltk.download() , then hang forever.

Has anyone seen this strange behavior?

There are several limitations (or maybe rather concepts) of Lambdas that are colliding with what you're trying to do here:

  • Lambdas are intended to execute rather simple functions. Accordingly, they come with a timeout which is by default quite short (3 seconds). Trying to run a downloading of some sort during the execution is going to get you into trouble. You can try to extend that timeout of course (up to 15 minutes), but you're going to run in other issues (see below).
  • Lambdas are short-lived, after five minutes of inactivity they are cleaned out, and will need to be re-instantiated on the subsequent call (what AWS calls a "cold start"). It means that even if you manage to get your nltk downloaded inside the Lambda, it won't be kept and will need to be reloaded after each cold start anyway.
  • I guess you could try to add manually the nltk data, but there is a 50MB size limitation that will give you a hard time adding data to the function in all cases.

If you need data to be available to your Lambda function, the easiest way to go is probably to use an S3 bucket to store the data. You can find a detailed example of how to do that here (credits to Alexey Smirnov ).

Got it: My Lambda function was running in a VPC. I had to add an endpoint to enable the VPC to access S3.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM