简体   繁体   中英

5 node m4.large Instances vs m4.2xlarge RDS

I'm asking this question to get some opinions on Amazon services.

I am currently running an RDS on a m4.2xlarge instance but I'm having performance issues on large databases. So I have decided to look into Big Data. I am thinking to start using hadoop with 5 Amazon m4.large or m4.xlarge Instances.

Does anyone have any similar experience or advice in the subject?

Hadoop and RDS are very different technologies and are not interchangeable.

RDS provides for very fast transaction processing (OLTP). Hadoop is tailored more towards batch processing (OLAP). With the advent of Spark, this line is moving. There are SQL query apps for Hadoop, but they will not replace a SQL database where it is strongest: complex queries, table joins, etc.

There is a point where data is just too big for traditional SQL servers. I would look into Redshift at that point. You will have to rethink how your data is stored, your query format, etc.

You have not provided details on the performance areas that are giving you problems. For read issues, look at scaling wider (read-replicas). For write issues you will need to scale bigger (larger / faster machine, faster storage, more memory, etc). In some cases just optimizing your queries can have significant effects.

In summary you question needs a lot more data before an informative answer could be created.

John Hanley is right, RDS and Hadoop are very different beasts. The question is, what kind of data are you working with?

If the data and your use cases are inherently relational in nature (foreign keys, indices, uniqueness constraints, ACID transactions, need for efficient joins and arbitrary queries) then you may be best served with a 'webscale' SQL database -- in this case I would recommend taking a look at Amazon Aurora. It is a drop-in replacement for either MySQL or PostgreSQL with vastly better performance and scalability.

If your data is sort of relational but your use case is more towards Business Intelligence (star/snowflake schemas, columnar aggregations, arbitrary drilldowns) and you are less dependent on write latency, you are probably better off with a data warehouse like Redshift.

If your data is more lookup-table-like, with the bulk of your queries being point queries into a large namespace (think GUIDs, cookie IDs, device IDs like IDFAs) then you're likely going to want a Key-Value store - DynamoDB would be the obvious choice on AWS, though for some workloads (and datasets smaller than, say, 100GB) you could also consider Redis on ElastiCache.

If your data is more event-like -- say, you are storing banner impressions or IoT messages -- then you probably want a stack that allows you to ingest new data in realtime; Druid or HBase+Phoenix may be the answer here, if not a dedicated time series database.

And finally, if your data is large and your common use case involves complex and arbitrary (non-precalculated) queries over high-terabytes or petabytes of data then Hadoop is going to be a great option, as it is a lot cheaper to store your data on S3 and spin up EMR clusters as needed than it would be to run the hardware needed to store the data into a database or datawarehouse stack. If this is the route you go, you can often get a very significant performance boost by storing your data in a columnar format (like Parquet) on disk and querying it with something like Spark SQL or Presto (Athena on AWS). However, once you switch to this kind of 'pure' big data stack you are in OLAP territory, meaning you are proabably looking at query times in the minutes to hours, rather than in the milliseconds to seconds, so that's something to be aware of.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM