简体繁体中英

How does Hadoop get input data not stored on HDFS?

原文 2015-06-25 08:32:42 3 1 java/ hadoop/ hdfs/ nosql

I'm trying to wrap my brain around Hadoop and read this excellent tutorial as well as perused the official Hadoop docs . However, in none of this literature can I find a simple explanation for something pretty rudimentary:

In all the contrived " Hello World! " (word count) introductory MR examples, the input data is stored directly in text files. However, to me, it feels like this would seldom be the case out in the real world. I would imagine that in reality, the input data would exist in large data stores, like a relational DB, Mongo, Cassandra, or only available via REST API, etc.

So I ask: In the real world, how does Hadoop get its input data? I do see that there are projects like Sqoop and Flume and am wondering if the whole point of these frameworks is to simply ETL input data onto HDFS for running MR jobs.

1 answers

Actually HDFS is needed in the Real world application for many reasons.

Very high bandwidth to support Map Reduce workloads and Scalability.
Data reliability and fault tolerant. Due to replication and by distributed nature. Required for critical data systems.
Flexibility - You don't have to pre-process the data to store that in HDFS.

Hadoop is designed to be write once and read many concept. Kafka, Flume and Sqoop which are generally used for ingestion are themselves very fault tolerant and provide high-bandwidth for data ingestion to HDFS. Sometimes it is required to ingest data from thousands for sources per minute with data in GBs. For this these tools are required as well as fault tolerant storage system-HDFS.

How to reference the dependencies (jars) stored in hdfs on Hadoop?

Hadoop - How to get a Path object of an HDFS file

How to mount Hadoop HDFS

How does Hadoop actually accept MR jobs and input data?

saving json data in hdfs in hadoop

How to get result of hadoop hdfs command to check cluster status

How to Serialize object in hadoop (in HDFS)

Data placement and distribution in HDFS for heterogeneous Hadoop cluster

Hadoop, how to get Input file name in OutputFormat

How to get user input in Hadoop 2.7.5?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to reference the dependencies (jars) stored in hdfs on Hadoop? Hadoop - How to get a Path object of an HDFS file How to mount Hadoop HDFS How does Hadoop actually accept MR jobs and input data? saving json data in hdfs in hadoop How to get result of hadoop hdfs command to check cluster status How to Serialize object in hadoop (in HDFS) Data placement and distribution in HDFS for heterogeneous Hadoop cluster Hadoop, how to get Input file name in OutputFormat How to get user input in Hadoop 2.7.5?

Related Tags

How does Hadoop get input data not stored on HDFS?

Question

1 answers

solution1 5 ACCPTED 2015-06-25 10:27:19

solution1
5 ACCPTED 2015-06-25 10:27:19