简体   繁体   English

AWS 中带有 HDFS 或 S3 的数据湖有什么区别?

[英]What is the difference between a data lake with HDFS or S3 in AWS?

I need to build a data lake on AWS, but I don't know how exactly S3 is different from HDFS.我需要在 AWS 上构建一个数据湖,但我不知道 S3 与 HDFS 究竟有何不同。 I found some answers in the Internet but I still don't understand the real difference.我在互联网上找到了一些答案,但我仍然不明白真正的区别。

I also need to know if someone has the data lake architecture of HDFS and S3 in AWS.我还需要知道是否有人在AWS中拥有HDFS和S3的数据湖架构。

HDFS is only accessible to the Hadoop cluster in which it exists. HDFS只能被它所在的 Hadoop 集群访问。 If the cluster turns off or is terminated, the data in HDFS will be gone.如果集群关闭或终止,HDFS 中的数据将消失。

Data in Amazon S3: Amazon S3 中的数据:

  • Remains available at all times (it cannot be 'turned off')始终可用(不能“关闭”)
  • Is accessible to multiple clusters可被多个集群访问
  • Is accessible to other AWS services , such as Amazon Athena (which is 'Presto as a service', so you might not even need a Hadoop cluster)可以访问其他 AWS 服务,例如 Amazon Athena(这是“Presto 即服务”,因此您甚至可能不需要 Hadoop 集群)
  • Has multiple storage classes , such as storing less-frequently accessed data at a lower cost具有多个存储类别,例如以较低的成本存储访问频率较低的数据
  • Does not have storage limits (while HDFS is limited to the storage available in the Hadoop cluster)没有存储限制(而 HDFS 仅限于 Hadoop 集群中可用的存储)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM