Map only jobs in spark (vs hadoop streaming)

Question

I have a function process_line that maps from the input format to the output format

Some rows are corrupted, and need to be ignored.

I am successfully running this code as a python streaming job:

for input_line in sys.stdin:
    try:
        output_line=process_line(input_line.strip())
        print (output_line)
    except:
        sys.stderr.write('Error with line: {l}\n'.format(l=input_line))
        continue

How can I run the equivalent code in pyspark ? This is what I tried:

input = sc.textFile(input_dir, 1)
output=lines.map(process_line)
output.saveAsTextFile(output_dir)

How can I keep track of corrupted lines and have statistics on their count ?

Answer 1

You're trying to read the text file to only one partition, which can cause your job to run slowly because you basically give up the parallelism.

Try to do this:

input = sc.textFile(input_dir)
output = lines.map(process_line)
output.saveAsTextFile(output_dir)

As for the corrupted lines, you can use a try-except mechanism in your process_line function and maybe write to some log file the problematic line, or try to do some other logic instead.

Map only jobs in spark (vs hadoop streaming)

Question

1 answers

solution1
2 ACCPTED 2015-11-22 09:19:41

Map only jobs in spark (vs hadoop streaming)

Question

1 answers

solution1 2 ACCPTED 2015-11-22 09:19:41

solution1
2 ACCPTED 2015-11-22 09:19:41