简体   繁体   English

Kafka Connect S3 sink 连接器与自定义 Partitioner 奇怪行为

[英]Kafka Connect S3 sink connector with custom Partitioner strange behavior


I plan to use a custom Field and TimeBased partitioner to partition my data in s3 as follow: /part_<field_name>=<field_value>/part_date=YYYY-MM-dd/part_hour=HH/....parquet.我计划使用自定义字段和基于时间的分区器在 s3 中对我的数据进行分区,如下所示:/part_<field_name>=<field_value>/part_date=YYYY-MM-dd/part_hour=HH/....parquet。

My Partitioner works fine, everything is as expected in my S3 bucket.我的 Partitioner 工作正常,一切都在我的 S3 存储桶中。

The problem is linked to the performance of the sink问题与水槽的性能有关
I have 400kB/s/broker = ~1.2MB/s in my input topic and the sink works with spikes and commit a small number of records.我的输入主题中有 400kB/s/broker = ~1.2MB/s,并且接收器使用尖峰并提交少量记录。

If I use the classic TimeBasedPartitioner, enter image description here如果我使用经典的 TimeBasedPartitioner,请在此处输入图像描述

So my problem seems to be in my custom partitioner.所以我的问题似乎出在我的自定义分区器中。 Here is the code:这是代码:

package test;
import ...;

public final class FieldAndTimeBasedPartitioner<T> extends TimeBasedPartitioner<T> {

private static final Logger log = LoggerFactory.getLogger(FieldAndTimeBasedPartitioner.class);
private static final String FIELD_SUFFIX = "part_";
private static final String FIELD_SEP = "=";
private long partitionDurationMs;
private DateTimeFormatter formatter;
private TimestampExtractor timestampExtractor;
private PartitionFieldExtractor partitionFieldExtractor;

protected void init(long partitionDurationMs, String pathFormat, Locale locale, DateTimeZone timeZone, Map<String, Object> config) {

    this.delim = (String)config.get("directory.delim");
    this.partitionDurationMs = partitionDurationMs;

    try {
        this.formatter = getDateTimeFormatter(pathFormat, timeZone).withLocale(locale);
        this.timestampExtractor = this.newTimestampExtractor((String)config.get("timestamp.extractor"));
        this.timestampExtractor.configure(config);
        this.partitionFieldExtractor = new PartitionFieldExtractor((String)config.get("partition.field"));
    } catch (IllegalArgumentException e) {
        ConfigException ce = new ConfigException("path.format", pathFormat, e.getMessage());
        ce.initCause(e);
        throw ce;
    }
}

private static DateTimeFormatter getDateTimeFormatter(String str, DateTimeZone timeZone) {
    return DateTimeFormat.forPattern(str).withZone(timeZone);
}

public static long getPartition(long timeGranularityMs, long timestamp, DateTimeZone timeZone) {
    long adjustedTimestamp = timeZone.convertUTCToLocal(timestamp);
    long partitionedTime = adjustedTimestamp / timeGranularityMs * timeGranularityMs;
    return timeZone.convertLocalToUTC(partitionedTime, false);
}

public String encodePartition(SinkRecord sinkRecord, long nowInMillis) {
    final Long timestamp = this.timestampExtractor.extract(sinkRecord, nowInMillis);
    final String partitionField = this.partitionFieldExtractor.extract(sinkRecord);
    return this.encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionField);
}

public String encodePartition(SinkRecord sinkRecord) {
    final Long timestamp = this.timestampExtractor.extract(sinkRecord);
    final String partitionFieldValue = this.partitionFieldExtractor.extract(sinkRecord);
    return encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionFieldValue);
}

private String encodedPartitionForFieldAndTime(SinkRecord sinkRecord, Long timestamp, String partitionField) {

    if (timestamp == null) {
        String msg = "Unable to determine timestamp using timestamp.extractor " + this.timestampExtractor.getClass().getName() + " for record: " + sinkRecord;
        log.error(msg);
        throw new ConnectException(msg);
    } else if (partitionField == null) {
        String msg = "Unable to determine partition field using partition.field '" + partitionField  + "' for record: " + sinkRecord;
        log.error(msg);
        throw new ConnectException(msg);
    }  else {
        DateTime recordTime = new DateTime(getPartition(this.partitionDurationMs, timestamp.longValue(), this.formatter.getZone()));
        return this.FIELD_SUFFIX
                + config.get("partition.field")
                + this.FIELD_SEP
                + partitionField
                + this.delim
                + recordTime.toString(this.formatter);
    }
}

static class PartitionFieldExtractor {

    private final String fieldName;

    PartitionFieldExtractor(String fieldName) {
        this.fieldName = fieldName;
    }

    String extract(ConnectRecord<?> record) {
        Object value = record.value();
        if (value instanceof Struct) {
            Struct struct = (Struct)value;
            return (String) struct.get(fieldName);
        } else {
            FieldAndTimeBasedPartitioner.log.error("Value is not of Struct !");
            throw new PartitionException("Error encoding partition.");
        }
    }
}

public long getPartitionDurationMs() {
    return partitionDurationMs;
}

public TimestampExtractor getTimestampExtractor() {
    return timestampExtractor;
}
}

It's more or less a merge of FieldPartitioner and TimeBasedPartitioner.它或多或少是 FieldPartitioner 和 TimeBasedPartitioner 的合并。

Any clue on why I have suck a bad performance on while sinking messages?关于为什么我在接收消息时表现不佳的任何线索? While partitioning using field in the record, deserialize and extract data from the message can cause this issue?使用记录中的字段进行分区时,反序列化并从消息中提取数据会导致此问题吗? As I have around 80 different fields values, can it be a memory issue as it will maintain 80 times more buffers in the heap?由于我有大约 80 个不同的字段值,这可能是 memory 问题,因为它将在堆中保持 80 倍的缓冲区?

Thanks for your help.谢谢你的帮助。

FYI, the problem was the partitioner itself.仅供参考,问题出在分区器本身。 My partitioner needed to decode the entire message and get the info.我的分区程序需要解码整个消息并获取信息。 As I have a lot of messages, it takes time to handle all these events.由于我有很多消息,处理所有这些事件需要时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM