简体   繁体   English

在 kafka-connect 接收器中提取字段和解析 JSON

[英]ExtractField and Parse JSON in kafka-connect sink

I have a kafka-connect flow of mongodb->kafka connect->elasticsearch sending data end to end OK, but the payload document is JSON encoded.我有一个 mongodb->kafka connect->elasticsearch 端到端发送数据的 kafka-connect 流,但有效负载文档是 JSON 编码的。 Here's my source mongodb document.这是我的源 mongodb 文档。

{
  "_id": "1541527535911",
  "enabled": true,
  "price": 15.99,
  "style": {
    "color": "blue"
  },
  "tags": [
    "shirt",
    "summer"
  ]
}

And here's my mongodb source connector configuration:这是我的 mongodb 源连接器配置:

{
  "name": "redacted",
  "config": {
    "connector.class": "com.teambition.kafka.connect.mongo.source.MongoSourceConnector",
    "databases": "redacted.redacted",
    "initial.import": "true",
    "topic.prefix": "redacted",
    "tasks.max": "8",
    "batch.size": "1",
    "key.serializer": "org.apache.kafka.common.serialization.StringSerializer",
    "value.serializer": "org.apache.kafka.common.serialization.JSONSerializer",
    "key.serializer.schemas.enable": false,
    "value.serializer.schemas.enable": false,
    "compression.type": "none",
    "mongo.uri": "mongodb://redacted:27017/redacted",
    "analyze.schema": false,
    "schema.name": "__unused__",
    "transforms": "RenameTopic",
    "transforms.RenameTopic.type":
      "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.RenameTopic.regex": "redacted.redacted_Redacted",
    "transforms.RenameTopic.replacement": "redacted"
  }
}

Over in elasticsearch, it ends up looking like this:在 elasticsearch 中,它最终看起来像这样:

{
  "_index" : "redacted",
  "_type" : "kafka-connect",
  "_id" : "{\"schema\":{\"type\":\"string\",\"optional\":true},\"payload\":\"1541527535911\"}",
  "_score" : 1.0,
  "_source" : {
    "ts" : 1541527536,
    "inc" : 2,
    "id" : "1541527535911",
    "database" : "redacted",
    "op" : "i",
    "object" : "{ \"_id\" : \"1541527535911\", \"price\" : 15.99,
      \"enabled\" : true, \"tags\" : [\"shirt\", \"summer\"],
      \"style\" : { \"color\" : \"blue\" } }"
  }
}

I'd like to do use 2 single message transforms:我想使用 2 个单消息转换:

  1. ExtractField to grab object , which is a string of JSON ExtractField来抓取object ,它是一串 JSON
  2. Something to parse that JSON into an object or just let the normal JSONConverter handle it, as long as it ends up as properly structured in elasticsearch.将 JSON 解析为对象或让普通 JSONConverter 处理它的东西,只要它最终在 elasticsearch 中结构正确即可。

I've attempted to do it with just ExtractField in my sink config, but I see this error logged by kafka我试图在我的接收器配置中只使用ExtractField来做到这一点,但我看到 kafka 记录了这个错误

kafka-connect_1       | org.apache.kafka.connect.errors.ConnectException:
Bulk request failed: [{"type":"mapper_parsing_exception",
"reason":"failed to parse", 
"caused_by":{"type":"not_x_content_exception",
"reason":"Compressor detection can only be called on some xcontent bytes or
compressed xcontent bytes"}}]

Here's my elasticsearch sink connector configuration.这是我的 elasticsearch sink 连接器配置。 In this version, I have things working but I had to code a custom ParseJson SMT.在这个版本中,我可以正常工作,但我必须编写自定义 ParseJson SMT。 It's working well, but if there's a better way or a way to do this with some combination of built-in stuff (converters, SMTs, whatever works), I'd love to see that.它运行良好,但如果有更好的方法或方法可以通过某种内置的东西(转换器、SMT 或任何有效的东西)的组合来做到这一点,我很乐意看到。

{
  "name": "redacted",
  "config": {
    "connector.class":
      "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "batch.size": 1,
    "connection.url": "http://redacted:9200",
    "key.converter.schemas.enable": true,
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "schema.ignore": true,
    "tasks.max": "1",
    "topics": "redacted",
    "transforms": "ExtractFieldPayload,ExtractFieldObject,ParseJson,ReplaceId",
    "transforms.ExtractFieldPayload.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
    "transforms.ExtractFieldPayload.field": "payload",
    "transforms.ExtractFieldObject.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
    "transforms.ExtractFieldObject.field": "object",
    "transforms.ParseJson.type": "reaction.kafka.connect.transforms.ParseJson",
    "transforms.ReplaceId.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.ReplaceId.renames": "_id:id",
    "type.name": "kafka-connect",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": false
  }
}

I am not sure about your Mongo connector.我不确定你的 Mongo 连接器。 I don't recognize the class or the configurations... Most people probably use Debezium Mongo connector我不认识类或配置...大多数人可能使用Debezium Mongo 连接器

I would setup this way, though不过我会这样设置

"connector.class": "com.teambition.kafka.connect.mongo.source.MongoSourceConnector",

"key.serializer": "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer": "org.apache.kafka.common.serialization.JSONSerializer",
"key.serializer.schemas.enable": false,
"value.serializer.schemas.enable": true,

The schemas.enable is important, that way the internal Connect data classes can know how to convert to/from other formats. schemas.enable很重要,这样内部 Connect 数据类就可以知道如何与其他格式进行转换。

Then, in the Sink, you again need to use JSON De Serializer (via the converter) so that it creates a full object rather than a plaintext string, as you see in Elasticsearch ( {\\"schema\\":{\\"type\\":\\"string\\" ).然后,在 Sink 中,您再次需要使用 JSON De Serializer(通过转换器),以便它创建一个完整的对象而不是纯文本字符串,如您在 Elasticsearch 中看到的( {\\"schema\\":{\\"type\\":\\"string\\" )。

"connector.class":
  "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",

"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": false,
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": true

And if this doesn't work, then you might have to manually create your index mapping in Elasticsearch ahead of time so it knows how to actually parse the strings you are sending it如果这不起作用,那么您可能必须提前在 Elasticsearch 中手动创建索引映射,以便它知道如何实际解析您发送的字符串

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM