简体繁体中英

Run HDFS cluster on AWS without EMR

原文 2020-05-04 07:32:55 3 2 amazon-web-services/ hadoop/ amazon-ec2/ hdfs/ google-cloud-dataproc

I want to run an HDFS cluster on AWS where I can store the data that needs to be processed using my custom application running on EC2 instances. AWS EMR is the only way I could find to create an HDFS cluster on AWS. There are tutorials available on the web to create HDFS cluster using EC2 instances. But, if I use EC2 instances, I run the risk of losing the data when I shut down the instances.

What I need is:
1. An HDFS cluster that can be shut down when not in use.
2. When shut down, data should remain persisted.

There is a solution that says I can keep my data in S3 bucket and load it everytime I start the EMR cluster. However, this is repetitive and a huge overhead specially if the data is huge.

In GCP, I used DataProc cluster which satisfied the above two criteria. Shutting down the cluster at least saved the cost of VMs and I only paid for storage when not using the HDFS cluster. I am wondering if there is some similar way in AWS.

2 answers

You can leverage the EFS Elastic File System which will save all your data to s3 and it will be available to you whenever you restart your ec2 instance.

Also, you can share this EFS with multiple EC2 instances if required. So creating EFS as HDFS is good option in your use case.

More details here .

I think you may have an XY problem here. You almost certainly do not want to have a remote HDFS filesystem on EMR.

EMR provides two HDFS-compatible filesystems to Hadoop and Spark natively:

1) A transient filesystem, accessed via hdfs://. This is primarily for scratch/temporary data. It lasts as long as the cluster does, and is backed by EBS.

2) A persistent filesystem, accessed via s3://. This is referred to as EMRFS in the documentation. It is backed by S3.

So for example if you're in Spark and you're used to doing something like spark.read.parquet("hdfs://mydata/somepartition/").doWork().write.parquet("hdfs://mynewdata/somepartition/")

you now just do spark.read.parquet("s3://mybucket/mydata/somepartition/").doWork().write.parquet("s3://mybucket/mynewdata/somepartition/")

and everything just works. s3:// is optimized by the EMR folks for speed since they know your EMR cluster shares a datacenter with the S3 data.

EFS, per Shubham Jain's answer, will probably cause problems with EMR, since you would be running effectively a second HDFS backend aside from the transient one provided with EMR. I suppose you could, but it would be a little strange. On your EMR cluster you would have to have NameNodes for EMR's HDFS, (referred to in EMR as core nodes), and separate NameNodes for the EFS-backed HDFS (which, I guess, would have to run as EMR task nodes?). EFS would be slower than the EBS-backed HDFS for transient data and more expensive than S3 for the permanent data.

If you don't want to use EMRFS for some reason (I have no idea why), you would probably be best off rolling your own cluster and not using EMR, because at that point you're looking to customize how HDFS is installed, and the point of EMR is to do that for you.

Why hdfs throwing LeaseExpiredException in Hadoop cluster (AWS EMR)

Run AWS EMR Cluster Using Step Functions

AWS EMR cluster not launching

Could AWS EMR run multi spark application in parallel in single cluster?

is there a way to kill a hive job without killing the AWS EMR cluster

monitoring spark cluster in AWS EMR without spark UI

Persisting HDFS state after EMR cluster is recreated

AWS EMR Automated Failover to another EMR Cluster

Autoscaling AWS EMR cluster to 0 nodes

AWS EMR Cluster fails to launch

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Why hdfs throwing LeaseExpiredException in Hadoop cluster (AWS EMR) Run AWS EMR Cluster Using Step Functions AWS EMR cluster not launching Could AWS EMR run multi spark application in parallel in single cluster? is there a way to kill a hive job without killing the AWS EMR cluster monitoring spark cluster in AWS EMR without spark UI Persisting HDFS state after EMR cluster is recreated AWS EMR Automated Failover to another EMR Cluster Autoscaling AWS EMR cluster to 0 nodes AWS EMR Cluster fails to launch

Related Tags

Run HDFS cluster on AWS without EMR

Question

2 answers

solution1
1 ACCPTED 2020-05-05 07:47:15

solution2
0 2020-05-05 22:08:16

Run HDFS cluster on AWS without EMR

Question

2 answers

solution1 1 ACCPTED 2020-05-05 07:47:15

solution2 0 2020-05-05 22:08:16

solution1
1 ACCPTED 2020-05-05 07:47:15

solution2
0 2020-05-05 22:08:16