简体   繁体   中英

How to omit empty part-000x files from Python streaming MapReduce job

I created a Python mapper that I run as a Hadoop streaming MapReduce job. It validates the input and writes a message to output if the input is invalid.

...
# input from STDIN
for line in sys.stdin:
    indata = json.loads(line)
    try:
        jsonschema.validate(indata,schema)
    except jsonschema.ValidationError, error:
        # validation against schema failed
        print error.message
    except:
        # other exceptions
        raise

My question: The mapper writes the message for invalid input as expected, but it also creates empty "part-0000x" files for valid input.

I would like to omit the empty output files. How can I achieve this?

To omit the empty output files use LazyOutputFormat class. It generates part files only when at least one record is generated for the particular file.

But LazyOutputFormat is in Java API, you find the corresponding API for Python

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM