简体   繁体   中英

Mapreduce - sequence jobs?

I am using MapReduce (just map, really) to do a data processing task in four phases. Each phase is one MapReduce job. I need them to run in sequence, that is, don't start phase 2 until phase 1 is done, etc. Does anyone have experience doing this that can share?

Ideally we'd do this 4-job sequence overnight, so making it cron-able would be a fine thing as well.

thank you

As Daniel mentions, the appengine-pipeline library is meant to solve this problem. I go over chaining mapreduce jobs together in this blog post , under the section "Implementing your own Pipeline jobs".

For convenience, I'll paste the relevant section here:

Now that we know how to launch the predefined MapreducePipeline, let's take a look at implementing and running our own custom pipeline jobs. The pipeline library provides a low-level library for launching arbitrary distributed computing jobs within appengine, but, for now, we'll talk specifically about how we can use this to help us chain mapreduce jobs together. Let's extend our previous example to also output a reverse index of characters and IDs.

First, we define the parent pipeline job.

class ChainMapReducePipeline(mapreduce.base_handler.PipelineBase):
def run(self):
    deduped_blob_key = (
    yield mapreduce.mapreduce_pipeline.MapreducePipeline(
        "test_combiner",
        "main.map",
        "main.reduce",
        "mapreduce.input_readers.RandomStringInputReader",
        "mapreduce.output_writers.BlobstoreOutputWriter",
        combiner_spec="main.combine",
        mapper_params={
            "string_length": 1,
            "count": 500,
        },
        reducer_params={
            "mime_type": "text/plain",
        },
        shards=16))

    char_to_id_index_blob_key = (
    yield mapreduce.mapreduce_pipeline.MapreducePipeline(
        "test_chain",
        "main.map2",
        "main.reduce2",
        "mapreduce.input_readers.BlobstoreLineInputReader",
        "mapreduce.output_writers.BlobstoreOutputWriter",
        # Pass output from first job as input to second job
        mapper_params=(yield BlobKeys(deduped_blob_key)),
        reducer_params={
            "mime_type": "text/plain",
        },
        shards=4))

This launches the same job as the first example, takes the output from that job, and feeds it into the second job, which reverses each entry. Notice that the result of the first pipeline yield is passed in to mapper_params of the second job. The pipeline library uses magic to detect that the second pipeline depends on the first one finishing and does not launch it until the deduped_blob_key has resolved.

Next, I had to create the BlobKeys helper class. At first, I didn't think this was necessary, since I could just do:

mapper_params={"blob_keys": deduped_blob_key},

But, this didn't work for two reasons. The first is that “generator pipelines cannot directly access the outputs of the child Pipelines that it yields”. The code above would require the generator pipeline to create a temporary dict object with the output of the first job, which is not allowed. The second is that the string returned by BlobstoreOutputWriter is of the format “/blobstore/”, but BlobstoreLineInputReader expects simply “”. To solve these problems, I made a little helper BlobKeys class. You'll find yourself doing this for many jobs, and the pipeline library even includes a set of common wrappers, but they do not work within the MapreducePipeline framework, which I discuss at the bottom of this section.

class BlobKeys(third_party.mapreduce.base_handler.PipelineBase):
  """Returns a dictionary with the supplied keyword arguments."""

  def run(self, keys):
    # Remove the key from a string in this format:
    # /blobstore/<key>
    return {
        "blob_keys": [k.split("/")[-1] for k in keys]
    }

Here is the code for the map2 and reduce2 functions:

def map2(data):
    # BlobstoreLineInputReader.next() returns a tuple
    start_position, line = data
    # Split input based on previous reduce() output format
    elements = line.split(" - ")
    random_id = elements[0]
    char = elements[1]
    # Swap 'em
    yield (char, random_id)

def reduce2(key, values):
    # Create the reverse index entry
    yield "%s - %s\n" % (key, ",".join(values))

I'm unfamiliar with google-app-engine, however couldn't you put all of the job-configurations in a single main program and then run them in sequence? something like the following? I think this works in normal map-reduce programs, so if google-app-engine code isn't too different it should work fine.

Configuration conf1 = getConf();
Configuration conf2 = getConf();
Configuration conf3 = getConf();
Configuration conf4 = getConf();

//whatever configuration you do for the jobs

Job job1 = new Job(conf1,"name1");
Job job2 = new Job(conf2,"name2");
Job job3 = new Job(conf3,"name3");
Job job4 = new Job(conf4,"name4");

//setup for the jobs here

job1.waitForCompletion(true);
job2.waitForCompletion(true);
job3.waitForCompletion(true);
job4.waitForCompletion(true);

您需要appengine-pipeline项目,该项目正是用于此目的。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM