简体   繁体   中英

Kafka Connect S3 sink connector with custom Partitioner strange behavior


I plan to use a custom Field and TimeBased partitioner to partition my data in s3 as follow: /part_<field_name>=<field_value>/part_date=YYYY-MM-dd/part_hour=HH/....parquet.

My Partitioner works fine, everything is as expected in my S3 bucket.

The problem is linked to the performance of the sink
I have 400kB/s/broker = ~1.2MB/s in my input topic and the sink works with spikes and commit a small number of records.

If I use the classic TimeBasedPartitioner, enter image description here

So my problem seems to be in my custom partitioner. Here is the code:

package test;
import ...;

public final class FieldAndTimeBasedPartitioner<T> extends TimeBasedPartitioner<T> {

private static final Logger log = LoggerFactory.getLogger(FieldAndTimeBasedPartitioner.class);
private static final String FIELD_SUFFIX = "part_";
private static final String FIELD_SEP = "=";
private long partitionDurationMs;
private DateTimeFormatter formatter;
private TimestampExtractor timestampExtractor;
private PartitionFieldExtractor partitionFieldExtractor;

protected void init(long partitionDurationMs, String pathFormat, Locale locale, DateTimeZone timeZone, Map<String, Object> config) {

    this.delim = (String)config.get("directory.delim");
    this.partitionDurationMs = partitionDurationMs;

    try {
        this.formatter = getDateTimeFormatter(pathFormat, timeZone).withLocale(locale);
        this.timestampExtractor = this.newTimestampExtractor((String)config.get("timestamp.extractor"));
        this.timestampExtractor.configure(config);
        this.partitionFieldExtractor = new PartitionFieldExtractor((String)config.get("partition.field"));
    } catch (IllegalArgumentException e) {
        ConfigException ce = new ConfigException("path.format", pathFormat, e.getMessage());
        ce.initCause(e);
        throw ce;
    }
}

private static DateTimeFormatter getDateTimeFormatter(String str, DateTimeZone timeZone) {
    return DateTimeFormat.forPattern(str).withZone(timeZone);
}

public static long getPartition(long timeGranularityMs, long timestamp, DateTimeZone timeZone) {
    long adjustedTimestamp = timeZone.convertUTCToLocal(timestamp);
    long partitionedTime = adjustedTimestamp / timeGranularityMs * timeGranularityMs;
    return timeZone.convertLocalToUTC(partitionedTime, false);
}

public String encodePartition(SinkRecord sinkRecord, long nowInMillis) {
    final Long timestamp = this.timestampExtractor.extract(sinkRecord, nowInMillis);
    final String partitionField = this.partitionFieldExtractor.extract(sinkRecord);
    return this.encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionField);
}

public String encodePartition(SinkRecord sinkRecord) {
    final Long timestamp = this.timestampExtractor.extract(sinkRecord);
    final String partitionFieldValue = this.partitionFieldExtractor.extract(sinkRecord);
    return encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionFieldValue);
}

private String encodedPartitionForFieldAndTime(SinkRecord sinkRecord, Long timestamp, String partitionField) {

    if (timestamp == null) {
        String msg = "Unable to determine timestamp using timestamp.extractor " + this.timestampExtractor.getClass().getName() + " for record: " + sinkRecord;
        log.error(msg);
        throw new ConnectException(msg);
    } else if (partitionField == null) {
        String msg = "Unable to determine partition field using partition.field '" + partitionField  + "' for record: " + sinkRecord;
        log.error(msg);
        throw new ConnectException(msg);
    }  else {
        DateTime recordTime = new DateTime(getPartition(this.partitionDurationMs, timestamp.longValue(), this.formatter.getZone()));
        return this.FIELD_SUFFIX
                + config.get("partition.field")
                + this.FIELD_SEP
                + partitionField
                + this.delim
                + recordTime.toString(this.formatter);
    }
}

static class PartitionFieldExtractor {

    private final String fieldName;

    PartitionFieldExtractor(String fieldName) {
        this.fieldName = fieldName;
    }

    String extract(ConnectRecord<?> record) {
        Object value = record.value();
        if (value instanceof Struct) {
            Struct struct = (Struct)value;
            return (String) struct.get(fieldName);
        } else {
            FieldAndTimeBasedPartitioner.log.error("Value is not of Struct !");
            throw new PartitionException("Error encoding partition.");
        }
    }
}

public long getPartitionDurationMs() {
    return partitionDurationMs;
}

public TimestampExtractor getTimestampExtractor() {
    return timestampExtractor;
}
}

It's more or less a merge of FieldPartitioner and TimeBasedPartitioner.

Any clue on why I have suck a bad performance on while sinking messages? While partitioning using field in the record, deserialize and extract data from the message can cause this issue? As I have around 80 different fields values, can it be a memory issue as it will maintain 80 times more buffers in the heap?

Thanks for your help.

FYI, the problem was the partitioner itself. My partitioner needed to decode the entire message and get the info. As I have a lot of messages, it takes time to handle all these events.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM