简体   繁体   中英

Run HDFS cluster on AWS without EMR

I want to run an HDFS cluster on AWS where I can store the data that needs to be processed using my custom application running on EC2 instances. AWS EMR is the only way I could find to create an HDFS cluster on AWS. There are tutorials available on the web to create HDFS cluster using EC2 instances. But, if I use EC2 instances, I run the risk of losing the data when I shut down the instances.

What I need is:
1. An HDFS cluster that can be shut down when not in use.
2. When shut down, data should remain persisted.

There is a solution that says I can keep my data in S3 bucket and load it everytime I start the EMR cluster. However, this is repetitive and a huge overhead specially if the data is huge.

In GCP, I used DataProc cluster which satisfied the above two criteria. Shutting down the cluster at least saved the cost of VMs and I only paid for storage when not using the HDFS cluster. I am wondering if there is some similar way in AWS.

You can leverage the EFS Elastic File System which will save all your data to s3 and it will be available to you whenever you restart your ec2 instance.

Also, you can share this EFS with multiple EC2 instances if required. So creating EFS as HDFS is good option in your use case.

More details here .

I think you may have an XY problem here. You almost certainly do not want to have a remote HDFS filesystem on EMR.

EMR provides two HDFS-compatible filesystems to Hadoop and Spark natively:

1) A transient filesystem, accessed via hdfs://. This is primarily for scratch/temporary data. It lasts as long as the cluster does, and is backed by EBS.

2) A persistent filesystem, accessed via s3://. This is referred to as EMRFS in the documentation. It is backed by S3.

So for example if you're in Spark and you're used to doing something like spark.read.parquet("hdfs://mydata/somepartition/").doWork().write.parquet("hdfs://mynewdata/somepartition/")

you now just do spark.read.parquet("s3://mybucket/mydata/somepartition/").doWork().write.parquet("s3://mybucket/mynewdata/somepartition/")

and everything just works. s3:// is optimized by the EMR folks for speed since they know your EMR cluster shares a datacenter with the S3 data.

EFS, per Shubham Jain's answer, will probably cause problems with EMR, since you would be running effectively a second HDFS backend aside from the transient one provided with EMR. I suppose you could, but it would be a little strange. On your EMR cluster you would have to have NameNodes for EMR's HDFS, (referred to in EMR as core nodes), and separate NameNodes for the EFS-backed HDFS (which, I guess, would have to run as EMR task nodes?). EFS would be slower than the EBS-backed HDFS for transient data and more expensive than S3 for the permanent data.

If you don't want to use EMRFS for some reason (I have no idea why), you would probably be best off rolling your own cluster and not using EMR, because at that point you're looking to customize how HDFS is installed, and the point of EMR is to do that for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM