简体   繁体   中英

Data streaming API- High availability

In my architecture on AWS, I have a service running on an EC2 instance which calls Twitter streaming API for data ingestion ie ingestion of real-time tweets. I call this service TwitterClient.

Twitter API uses a kindof long polling over HTTP protocol to deliver streaming data. The documentation says- a single connection is opened between your app (in my case, TwitterClient) and the API, with new tweets being sent through that connection.

TwitterClient then passes the real-time tweets to the backend (using Kinesis Data streams) for processing.

The problem I am facing is- running multiple EC2 instances in parallel will result in duplicate tweets being ingested and each tweet will be processed several times. However, only one instance of EC2 becomes a single point of failure.

I cannot afford downtime as I can't miss a single tweet.

What should I do to ensure high availability?

Edit: Added a brief description of how Twitter API delivers streaming data

The simplest way to implement this is to run multiple EC2 instances in parallel, in different regions. You can certainly get more complex, and use heartbeats between the instances, but this is probably over-engineering.

multiple EC2 instances in parallel will result in duplicate tweets being ingested and each tweet will be processed several times

Tweets have a unique message ID that can be used to deduplicate.

I can't miss a single tweet

This is your real problem. Twitter limits you to a certain number of requests per 15 minute period. Assuming that you have reasonable filter rules (ie, you don't try to read the entire tweetstream, or even the tweetstream for a broad topic), then this should be sufficient to capture all tweets.

However, it may not be sufficient if you're running multiple instances. You could try using two API keys (assuming that Twitter allows that) or adjust your polling frequency to something that allows multiple instances to run concurrently.

Beware, however: as far as I know there are no guarantees. If you need guaranteed access to every relevant tweet, you would need to talk to Twitter (and be prepared to pay them for the privilege).

You can setup to run 2 EC2 behind a Load Balancer, keeping only one EC2 instance active at a time and other as passive (or backup). 2nd will be active when 1st is down.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM