Kafka Connect S3 sink 连接器与自定义 Partitioner 奇怪行为

Question

I plan to use a custom Field and TimeBased partitioner to partition my data in s3 as follow: /part_<field_name>=<field_value>/part_date=YYYY-MM-dd/part_hour=HH/....parquet.我计划使用自定义字段和基于时间的分区器在 s3 中对我的数据进行分区，如下所示：/part_<field_name>=<field_value>/part_date=YYYY-MM-dd/part_hour=HH/....parquet。

My Partitioner works fine, everything is as expected in my S3 bucket.我的 Partitioner 工作正常，一切都在我的 S3 存储桶中。

The problem is linked to the performance of the sink问题与水槽的性能有关
I have 400kB/s/broker = ~1.2MB/s in my input topic and the sink works with spikes and commit a small number of records.我的输入主题中有 400kB/s/broker = ~1.2MB/s，并且接收器使用尖峰并提交少量记录。

If I use the classic TimeBasedPartitioner, enter image description here如果我使用经典的 TimeBasedPartitioner，请在此处输入图像描述

So my problem seems to be in my custom partitioner.所以我的问题似乎出在我的自定义分区器中。 Here is the code:这是代码：

package test;
import ...;

public final class FieldAndTimeBasedPartitioner<T> extends TimeBasedPartitioner<T> {

private static final Logger log = LoggerFactory.getLogger(FieldAndTimeBasedPartitioner.class);
private static final String FIELD_SUFFIX = "part_";
private static final String FIELD_SEP = "=";
private long partitionDurationMs;
private DateTimeFormatter formatter;
private TimestampExtractor timestampExtractor;
private PartitionFieldExtractor partitionFieldExtractor;

protected void init(long partitionDurationMs, String pathFormat, Locale locale, DateTimeZone timeZone, Map<String, Object> config) {

    this.delim = (String)config.get("directory.delim");
    this.partitionDurationMs = partitionDurationMs;

    try {
        this.formatter = getDateTimeFormatter(pathFormat, timeZone).withLocale(locale);
        this.timestampExtractor = this.newTimestampExtractor((String)config.get("timestamp.extractor"));
        this.timestampExtractor.configure(config);
        this.partitionFieldExtractor = new PartitionFieldExtractor((String)config.get("partition.field"));
    } catch (IllegalArgumentException e) {
        ConfigException ce = new ConfigException("path.format", pathFormat, e.getMessage());
        ce.initCause(e);
        throw ce;
    }
}

private static DateTimeFormatter getDateTimeFormatter(String str, DateTimeZone timeZone) {
    return DateTimeFormat.forPattern(str).withZone(timeZone);
}

public static long getPartition(long timeGranularityMs, long timestamp, DateTimeZone timeZone) {
    long adjustedTimestamp = timeZone.convertUTCToLocal(timestamp);
    long partitionedTime = adjustedTimestamp / timeGranularityMs * timeGranularityMs;
    return timeZone.convertLocalToUTC(partitionedTime, false);
}

public String encodePartition(SinkRecord sinkRecord, long nowInMillis) {
    final Long timestamp = this.timestampExtractor.extract(sinkRecord, nowInMillis);
    final String partitionField = this.partitionFieldExtractor.extract(sinkRecord);
    return this.encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionField);
}

public String encodePartition(SinkRecord sinkRecord) {
    final Long timestamp = this.timestampExtractor.extract(sinkRecord);
    final String partitionFieldValue = this.partitionFieldExtractor.extract(sinkRecord);
    return encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionFieldValue);
}

private String encodedPartitionForFieldAndTime(SinkRecord sinkRecord, Long timestamp, String partitionField) {

    if (timestamp == null) {
        String msg = "Unable to determine timestamp using timestamp.extractor " + this.timestampExtractor.getClass().getName() + " for record: " + sinkRecord;
        log.error(msg);
        throw new ConnectException(msg);
    } else if (partitionField == null) {
        String msg = "Unable to determine partition field using partition.field '" + partitionField  + "' for record: " + sinkRecord;
        log.error(msg);
        throw new ConnectException(msg);
    }  else {
        DateTime recordTime = new DateTime(getPartition(this.partitionDurationMs, timestamp.longValue(), this.formatter.getZone()));
        return this.FIELD_SUFFIX
                + config.get("partition.field")
                + this.FIELD_SEP
                + partitionField
                + this.delim
                + recordTime.toString(this.formatter);
    }
}

static class PartitionFieldExtractor {

    private final String fieldName;

    PartitionFieldExtractor(String fieldName) {
        this.fieldName = fieldName;
    }

    String extract(ConnectRecord<?> record) {
        Object value = record.value();
        if (value instanceof Struct) {
            Struct struct = (Struct)value;
            return (String) struct.get(fieldName);
        } else {
            FieldAndTimeBasedPartitioner.log.error("Value is not of Struct !");
            throw new PartitionException("Error encoding partition.");
        }
    }
}

public long getPartitionDurationMs() {
    return partitionDurationMs;
}

public TimestampExtractor getTimestampExtractor() {
    return timestampExtractor;
}
}

It's more or less a merge of FieldPartitioner and TimeBasedPartitioner.它或多或少是 FieldPartitioner 和 TimeBasedPartitioner 的合并。

Any clue on why I have suck a bad performance on while sinking messages?关于为什么我在接收消息时表现不佳的任何线索？ While partitioning using field in the record, deserialize and extract data from the message can cause this issue?使用记录中的字段进行分区时，反序列化并从消息中提取数据会导致此问题吗？ As I have around 80 different fields values, can it be a memory issue as it will maintain 80 times more buffers in the heap?由于我有大约 80 个不同的字段值，这可能是 memory 问题，因为它将在堆中保持 80 倍的缓冲区？

Thanks for your help.谢谢你的帮助。

Answer 1

FYI, the problem was the partitioner itself.仅供参考，问题出在分区器本身。 My partitioner needed to decode the entire message and get the info.我的分区程序需要解码整个消息并获取信息。 As I have a lot of messages, it takes time to handle all these events.由于我有很多消息，处理所有这些事件需要时间。

Kafka Connect S3 sink 连接器与自定义 Partitioner 奇怪行为

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-09-03 13:17:46

Kafka Connect S3 sink 连接器与自定义 Partitioner 奇怪行为

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-09-03 13:17:46

解决方案1
0 已采纳 2020-09-03 13:17:46