简体   繁体   中英

Handle(Drop and Log) bad data published by Kafka producer , such that Spark (Java) Consumer doesn't store it in HDFS

Currently, I am using Spark Consumer built in Java to read records(json) published by Kafka Producer and store it in hdfs. If let's say my record has following attributes (id, name, company, published date), Currently, I am handling the exception such that if one of the attribute is missing then the program throws a Run time Exception with log message displaying that one of the attribute is missing, but the problem is, due to the exception the whole spark jobs completely stops. I would like to handle those bad records, by avoiding this such that instead of stopping the whole spark job, the program would drop and log those bad records instead of throwing exception.

The answer is going to be opinion based. Here is what I would do,

Don't log rejections in a log file because that could be big and you may need to reprocess them. Instead create another dataset for rejected records with reason for rejection. Your process would produce 2 data sets - good ones and rejected ones.

Exception shouldn't be used for control flow of the code though it is possible. I would use the idea of predicate/filter/IF-condition which will check on the data and reject the ones not meeting the predicate/filter/IF-condition.

If you are using exception then bound it around processing of an individual record not the entire job. It is better to avoid this idea.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM