I have a function process_line
that maps from the input format to the output format
Some rows are corrupted, and need to be ignored.
I am successfully running this code as a python streaming job:
for input_line in sys.stdin:
try:
output_line=process_line(input_line.strip())
print (output_line)
except:
sys.stderr.write('Error with line: {l}\n'.format(l=input_line))
continue
How can I run the equivalent code in pyspark ? This is what I tried:
input = sc.textFile(input_dir, 1)
output=lines.map(process_line)
output.saveAsTextFile(output_dir)
How can I keep track of corrupted lines and have statistics on their count ?
You're trying to read the text file to only one partition, which can cause your job to run slowly because you basically give up the parallelism.
Try to do this:
input = sc.textFile(input_dir)
output = lines.map(process_line)
output.saveAsTextFile(output_dir)
As for the corrupted lines, you can use a try-except mechanism in your process_line
function and maybe write to some log file the problematic line, or try to do some other logic instead.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.