简体繁体中英

Handle(Drop and Log) bad data published by Kafka producer , such that Spark (Java) Consumer doesn't store it in HDFS

原文 2020-03-17 15:03:03 5 1 java/ apache-spark/ exception/ error-handling/ apache-kafka

Currently, I am using Spark Consumer built in Java to read records(json) published by Kafka Producer and store it in hdfs. If let's say my record has following attributes (id, name, company, published date), Currently, I am handling the exception such that if one of the attribute is missing then the program throws a Run time Exception with log message displaying that one of the attribute is missing, but the problem is, due to the exception the whole spark jobs completely stops. I would like to handle those bad records, by avoiding this such that instead of stopping the whole spark job, the program would drop and log those bad records instead of throwing exception.

1 answers

The answer is going to be opinion based. Here is what I would do,

Don't log rejections in a log file because that could be big and you may need to reprocess them. Instead create another dataset for rejected records with reason for rejection. Your process would produce 2 data sets - good ones and rejected ones.

Exception shouldn't be used for control flow of the code though it is possible. I would use the idea of predicate/filter/IF-condition which will check on the data and reject the ones not meeting the predicate/filter/IF-condition.

If you are using exception then bound it around processing of an individual record not the entire job. It is better to avoid this idea.

Consumer Producer doesn't work in java

Kafka Messages - Producer & Consumer Client in Java

Kafka java producer and consumer with ACL enabled with topic

Spark Kafka streaming doesn't distribute consumer load on worker nodes

Kafka producer and consumer delay

Consumer doesn't work in my simple producer/consumer/queue code in Java

Apache kafka producer does not store data

How to get back Kafka producer and consumer configuration (Java API)?

Multiple Consumer setup for Single Producer with 4 partitions Kafka Java

Kafka: No message seen on console consumer after message sent by Java Producer

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Consumer Producer doesn't work in java Kafka Messages - Producer & Consumer Client in Java Kafka java producer and consumer with ACL enabled with topic Spark Kafka streaming doesn't distribute consumer load on worker nodes Kafka producer and consumer delay Consumer doesn't work in my simple producer/consumer/queue code in Java Apache kafka producer does not store data How to get back Kafka producer and consumer configuration (Java API)? Multiple Consumer setup for Single Producer with 4 partitions Kafka Java Kafka: No message seen on console consumer after message sent by Java Producer

Related Tags

Handle(Drop and Log) bad data published by Kafka producer , such that Spark (Java) Consumer doesn't store it in HDFS

Question

1 answers

solution1 0 2020-03-17 16:24:53

solution1
0 2020-03-17 16:24:53