简体   繁体   中英

Map only jobs in spark (vs hadoop streaming)

I have a function process_line that maps from the input format to the output format

Some rows are corrupted, and need to be ignored.

I am successfully running this code as a python streaming job:

for input_line in sys.stdin:
    try:
        output_line=process_line(input_line.strip())
        print (output_line)
    except:
        sys.stderr.write('Error with line: {l}\n'.format(l=input_line))
        continue

How can I run the equivalent code in pyspark ? This is what I tried:

input = sc.textFile(input_dir, 1)
output=lines.map(process_line)
output.saveAsTextFile(output_dir)

How can I keep track of corrupted lines and have statistics on their count ?

You're trying to read the text file to only one partition, which can cause your job to run slowly because you basically give up the parallelism.

Try to do this:

input = sc.textFile(input_dir)
output = lines.map(process_line)
output.saveAsTextFile(output_dir)

As for the corrupted lines, you can use a try-except mechanism in your process_line function and maybe write to some log file the problematic line, or try to do some other logic instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM