Running a python script against a bunch of S3 files

Question

I have a python script that I want to run on S3 files and send the output to another S3 bucket.

Now I could kick off an EC2 instance and interact using boto to do this and that's fine. But this doesn't seem to have an automatic way of shutting down the EC2 once the processes are complete (I'm going to be operating on about 100GB worth of data so I don't want to sit there and watch it).

The data pipelines of AWS seem attractive in that they scale appropriately and release resources when done. Which is great. But I can't seem to find a way to run a python script in a pipeline. The ShellCommandActivity seems closest but I'm not able to figure out how to set it up such that I can have the proper virtual environment built (with the appropriate packages, etc). Trying to figure out the best way to achieve this. Any help would be greatly appreciated

Answer 1

The resources Data Pipeline brings up have Python installed on them. You can just use a ShellCommandActivity and run Python. Here is an example pipeline running a ShellCommandActivity: https://github.com/awslabs/data-pipeline-samples/tree/master/samples/helloworld

You can substitute the script with something like:

python -c 'print "Hi"'

Or if you have your Python scripts on S3 you can download and run them

wget https://s3.bucket.url/script.py
python script.py

Running a python script against a bunch of S3 files

Question

1 answers

solution1
0 2016-03-08 19:31:51

Running a python script against a bunch of S3 files

Question

1 answers

solution1 0 2016-03-08 19:31:51

solution1
0 2016-03-08 19:31:51