简体繁体中英

Synchronize data to HBase/HDFS and use it as input to MapReduce job

原文 2012-02-17 13:37:23 2 1 hadoop/ mapreduce/ hbase/ hdfs

I would like to synchronize data to a Hadoop filesystem. This data is intended to be used as input for a scheduled MapReduce job.

This example might explain more:

Lets say I have an input stream of documents which contain a bunch of words, these words are needed as input for a MapReduce WordCount job. So, for each document, all words should be parsed out and uploaded to the filesystem. However, if the same document arrives from the input stream again, I only want the changes to be uploaded (or deleted) from the filesystem.

How should the data be stored; should I use HDFS or HBase? The amount of data is not very large, maybe a couple of GB.

Is it possible to start scheduled MapReduce jobs with input from HDFS and/or HBase?

1 answers

I would first pick the best tool for the job, or do some research to make a reasonable choice. You're asking the question, which is the most important step. Given the amount of data you're planning to process, Hadoop is probably just one option. If this is the first step towards bigger and better things, then that would narrow the field.

I would then start off with the simplest approach that I expect to work, which typically means using the tools I already know. Write code flexibly to make it easier to replace original choices with better ones as you learn more or run into roadblocks. Given what you've stated in your question, I'd start off by using HDFS, using Hadoop command-lines tools to push the data to an HDFS folder (hadoop fs -put ...). Then, I'd write an MR job or jobs to do the processing, running them manually. When it was working I'd probably use cron to handle scheduling of the jobs.

That's a place to start. As you build the process, if you reach a point where HBase seems like a natural fit for what you want to store, then switch over to that. Solve one problem at a time, and that will give you clarity on which tools are the right choice each step of the way. For example, you might get to the scheduling step and know by that time that cron won't do what you need - perhaps your organization has requirements for job scheduling that cron won't fulfil. So, you pick a different tool.

How to use a HBase secondary index table as and input in a MapReduce Job?

What would happen to a MapReduce job if input data source keep increasing in HDFS?

using HBase instead of HDFS in MapReduce

hadoop map reduce job with HDFS input and HBASE output

HBase table as MapReduce input?

Use some datatype as input for a MapReduce job.

Is it possible to Stream data as input into MapReduce job

HBase to Use HDFS HA

Hadoop MapReduce: Is it possible to only use a fraction of the input data as the input to a MR job?

HBase and HDFS data delimiters?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to use a HBase secondary index table as and input in a MapReduce Job? What would happen to a MapReduce job if input data source keep increasing in HDFS? using HBase instead of HDFS in MapReduce hadoop map reduce job with HDFS input and HBASE output HBase table as MapReduce input? Use some datatype as input for a MapReduce job. Is it possible to Stream data as input into MapReduce job HBase to Use HDFS HA Hadoop MapReduce: Is it possible to only use a fraction of the input data as the input to a MR job? HBase and HDFS data delimiters?

Related Tags

Synchronize data to HBase/HDFS and use it as input to MapReduce job

Question

1 answers

solution1 0 2012-02-17 14:36:21

solution1
0 2012-02-17 14:36:21