Architecture on AWS : Running a distributed algorithm on dynamic nodes

Question

As shown in the digram,the pet-project that I am working on has two following components.

a) The "RestAPI layer" (set of micro-services)

b) "Scalable Parallelized Algorithm" component.

I am planing on running this on AWS.I realized that I can use ElasticBeanTalk to deploy my RestAPI module.(Spring Boot JAR with embedded tomcat)

I am thinking how to architect the "Scalable Parallelized Algorithm" component.Here are some design details about this:

This consist of couple of Nodes which share the same data stored on S3.
Each node perform the "algorithm" on a chunk of S3 data.One node works as master node and rest of the nodes send the partial result to this node.(embarrassingly parallel,master-slave paradigm).Master node get invoked by the RestAPI layer.
A "Node" is a Spring Boot application which communicates with other nodes through HTTP.
Number of "Nodes" is dynamic ,which means I should be able to manually add a new Node depend on the increasing data size of S3.
There is a "Node Registry" on Redis which contains IPs of all the nodes.Each node register itself , and use the list of IPs in the registry to communicate with each other.

My questions:

1) Shall I use EC2 to deploy "Nodes" or can I use ElasticBeanStalk to deploy these nodes as well.I know with EC2 I can manage the number of nodes depend on the size of S3 data, but is it possible to do this with ElasticBeanStalk?

2) Can I use

Inet4Address.getLocalHost().getHostAddress()

to get the IP of the each Node ? Do EC2 instances have more than one IP ? This IP should be allow the RestAPI Layer to communicate with the "master" Node.

3) Whats the component I should use expose my RestAPI layer to the external world ? But I dont want to expose my "Nodes".

Update : I cant use MapReduce since the nodes have state. ie, During initialization , each Node read its chunk of data from S3 and create the "vector space" in memory.This a time consuming process , so thats why this should be stored in memory.Also this system need near-real-time response , cannot use a "batch" system like MR.

Answer 1

1) I would look into CloudFormation to help you automate and orchestrate the Scalable Parallelized Algorithm. Read this FAQ

https://aws.amazon.com/cloudformation/faqs/

2) With regards to question #2, EC2 instances can have a private and public ip, depending on how you configure them. You can query the AWS EC2 Metadata service from the instance to obtain the information like this:

curl http://169.254.169.254/latest/meta-data/public-ipv4

or

curl http://169.254.169.254/latest/meta-data/local-ipv4

Full reference to EC2 instance metadata:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html

3) Check out the API Gateway service, it might be what you are looking for:

https://aws.amazon.com/api-gateway/faqs/

Answer 2

Some general principles

Use infrastructure automation: CloudFormation or Troposphere over CloudFormation. This would make your system clean and easy to maintain.
Use Tagging: this keeps your AWS account nice and tidy. Also you can do funky scripts like describe all instances based on Tags, which can be a one-liner CLI/SDK call returning all the IPs of your "slave" instances.
Use more Tags, it can be really powerful.

ElasticBeanstalk VS "manual" setup

ElasticBeanstalk sounds like a good choice to me, but it's important to see, it's using the same components which I would recommend:

Create an AMI which contains your Slave Instance ready to go, or
Create an AMI and use UserData to configure your Slave, or
Create an AMI and/or use an orchestration tool like Chef or Puppet to configure your slave instance.
Use this AMI in an Autoscaling Launch Config
Create an AutoScalingGroup which can be on a fix number of instances or can scale based on a metric.
Pro setup: if you can somehow count the jobs waiting for execution, that can be a metric for scaling up or down automatically
Pro+ tip: use the Master node to create the jobs, put the jobs into an SQS queue. The length of the queue is a good metric for scaling. Failed jobs are back in the queue and will be re-executed. ( The SQS message contains only a reference, not the full data of the job.)
Using a queue would decouple your environment which is highly recommended

To be clear, ElasticBeanstalk does something similar. Actually if you create a multi node Beanstalk stack, it will run a CloudFromation template, create an ELB, an ASG, a LCFG, and Instances. You just have a bit less control but also less management overhead.

If you go with Beanstalk, you need Worker Environment which also creates the SQS queue for you. If you go for a Worker Environment, you can find tutorials, working examples, which makes your start easier.

Further to read: Background Task Handling for AWS Elastic Beanstalk Architectural Overview

2) You can use CLI, it has some filtering capabilities, or you can use other commands like jq for filtering/formatting the output. Here is a similar example . Note: Use tags and then you can easily filter the instances. Or you can query based on the ELB/ASG.

3) Exposing your API via the API Gateway sounds a good solution. I assume you want to expose only the Master node(s) since thats what managing the tasks.

Architecture on AWS : Running a distributed algorithm on dynamic nodes

Question

2 answers

solution1
2 2017-01-10 05:00:07

1) I would look into CloudFormation to help you automate and orchestrate the Scalable Parallelized Algorithm. Read this FAQ

2) With regards to question #2, EC2 instances can have a private and public ip, depending on how you configure them. You can query the AWS EC2 Metadata service from the instance to obtain the information like this:

3) Check out the API Gateway service, it might be what you are looking for:

solution2
1 ACCPTED 2017-01-10 15:18:27

Architecture on AWS : Running a distributed algorithm on dynamic nodes

Question

2 answers

solution1 2 2017-01-10 05:00:07

1) I would look into CloudFormation to help you automate and orchestrate the Scalable Parallelized Algorithm. Read this FAQ

2) With regards to question #2, EC2 instances can have a private and public ip, depending on how you configure them. You can query the AWS EC2 Metadata service from the instance to obtain the information like this:

3) Check out the API Gateway service, it might be what you are looking for:

solution2 1 ACCPTED 2017-01-10 15:18:27

solution1
2 2017-01-10 05:00:07

solution2
1 ACCPTED 2017-01-10 15:18:27