[英]Unable to convert Kafka topic data into structured JSON with Confluent Elasticsearch sink connector
I'm building a data pipeline using Kafka. 我正在使用Kafka构建数据管道。 Data flow is as follows: capture data change in mongodb and have it sent to elasticsearch.
数据流如下:在mongodb中捕获数据更改,并将其发送给elasticsearch。
MongoDB MongoDB
Kafka 卡夫卡
Elasticsearch 弹性搜索
Since I'm still testing, Kafka-related systems are running on single server. 由于我仍在测试中,与Kafka相关的系统在单个服务器上运行。
start zookeepr 启动Zookeepr
$ bin/zookeeper-server-start etc/kafka/zookeeper.properties
start bootstrap server 启动引导服务器
$ bin/kafka-server-start etc/kafka/server.properties
start registry schema 启动注册表架构
$ bin/schema-registry-start etc/schema-registry/schema-registry.properties
start mongodb source connetor 启动mongodb源connetor
$ bin/connect-standalone \\ etc/schema-registry/connect-avro-standalone.properties \\ etc/kafka/connect-mongo-source.properties $ cat etc/kafka/connect-mongo-source.properties >>> name=mongodb-source-connector connector.class=io.debezium.connector.mongodb.MongoDbConnector mongodb.hosts='' initial.sync.max.threads=1 tasks.max=1 mongodb.name=higee $ cat etc/schema-registry/connect-avro-standalone.properties >>> bootstrap.servers=localhost:9092 key.converter=io.confluent.connect.avro.AvroConverter key.converter.schema.registry.url=http://localhost:8081 value.converter=io.confluent.connect.avro.AvroConverter value.converter.schema.registry.url=http://localhost:8081 internal.key.converter=org.apache.kafka.connect.json.JsonConverter internal.value.converter=org.apache.kafka.connect.json.JsonConverter internal.key.converter.schemas.enable=false internal.value.converter.schemas.enable=false rest.port=8083
start elasticsearch sink connector 启动弹性搜索接收器连接器
$ bin/connect-standalone \\ etc/schema-registry/connect-avro-standalone2.properties \\ etc/kafka-connect-elasticsearch/elasticsearch.properties $ cat etc/kafka-connect-elasticsearch/elasticsearch.properties >>> name=elasticsearch-sink connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector tasks.max=1 topics=higee.higee.higee key.ignore=true connection.url='' type.name=kafka-connect $ cat etc/schema-registry/connect-avro-standalone2.properties >>> bootstrap.servers=localhost:9092 key.converter=io.confluent.connect.avro.AvroConverter key.converter.schema.registry.url=http://localhost:8081 value.converter=io.confluent.connect.avro.AvroConverter value.converter.schema.registry.url=http://localhost:8081 internal.key.converter=org.apache.kafka.connect.json.JsonConverter internal.value.converter=org.apache.kafka.connect.json.\\ JsonConverter internal.key.converter.schemas.enable=false internal.value.converter.schemas.enable=false rest.port=8084
Everything is fine with above system. 上面的系统一切都很好。 Kafka connector captures data changes (CDC) and successfully sends it to elasticsearch via sink connector.
Kafka连接器捕获数据更改(CDC),并通过接收器连接器成功将其发送到elasticsearch。 The problem is that I cannot convert string-type-messaged data into structured data type.
问题是我无法将字符串类型的消息数据转换为结构化数据类型。 For instance, let's consume topic-data after making some changes to mongodb.
例如,在对mongodb进行一些更改之后,让我们使用主题数据。
$ bin/kafka-avro-console-consumer \
--bootstrap-server localhost:9092 \
--topic higee.higee.higee --from-beginning | jq
Then, I get following result. 然后,我得到以下结果。
"after": null,
"patch": {
"string": "{\"_id\" : {\"$oid\" : \"5ad97f982a0f383bb638ecac\"},\"name\" : \"higee\",\"salary\" : 100,\"origin\" : \"South Korea\"}"
},
"source": {
"version": {
"string": "0.7.5"
},
"name": "higee",
"rs": "172.31.50.13",
"ns": "higee",
"sec": 1524214412,
"ord": 1,
"h": {
"long": -2379508538412995600
},
"initsync": {
"boolean": false
}
},
"op": {
"string": "u"
},
"ts_ms": {
"long": 1524214412159
}
}
Then, if I go to elasticsearch, I get following result. 然后,如果我去elasticsearch,我得到以下结果。
{
"_index": "higee.higee.higee",
"_type": "kafka-connect",
"_id": "higee.higee.higee+0+3",
"_score": 1,
"_source": {
"after": null,
"patch": """{"_id" : {"$oid" : "5ad97f982a0f383bb638ecac"},
"name" : "higee",
"salary" : 100,
"origin" : "South Korea"}""",
"source": {
"version": "0.7.5",
"name": "higee",
"rs": "172.31.50.13",
"ns": "higee",
"sec": 1524214412,
"ord": 1,
"h": -2379508538412995600,
"initsync": false
},
"op": "u",
"ts_ms": 1524214412159
}
}
One that I want to achieve is something as follows 我要实现的目标如下
{
"_index": "higee.higee.higee",
"_type": "kafka-connect",
"_id": "higee.higee.higee+0+3",
"_score": 1,
"_source": {
"oid" : "5ad97f982a0f383bb638ecac",
"name" : "higee",
"salary" : 100,
"origin" : "South Korea"
}"
}
Some of the options I've been trying and still considering is as follows. 我一直在尝试并仍在考虑的一些选项如下。
logstash Logstash
case 1 : don't know how to parse those characters (/u0002, /u0001) 情况1:不知道如何解析这些字符(/ u0002,/ u0001)
logstash.conf logstash.conf
input { kafka { bootstrap_servers => ["localhost:9092"] topics => ["higee.higee.higee"] auto_offset_reset => "earliest" codec => json { charset => "UTF-8" } } } filter { json { source => "message" } } output { stdout { codec => rubydebug } }
result 结果
{ "message" => "H\ \{\\"_id\\" : \\ {\\"$oid\\" : \\"5adafc0e2a0f383bb63910a6\\"}, \\ \\"name\\" : \\"higee\\", \\ \\"salary\\" : 101, \\ \\"origin\\" : \\"South Korea\\"} \\ \\\n0.7.5\\nhigee \\ \172.31.50.13\higee.higee2 \\ ح\\v\\ ̗ \\u\ X", "tags" => [[0] "_jsonparsefailure"] }
case 2 情况2
logstash.conf logstash.conf
input { kafka { bootstrap_servers => ["localhost:9092"] topics => ["higee.higee.higee"] auto_offset_reset => "earliest" codec => avro { schema_uri => "./test.avsc" } } } filter { json { source => "message" } } output { stdout { codec => rubydebug } }
test.avsc 测试文件
{ "namespace": "example", "type": "record", "name": "Higee", "fields": [ {"name": "_id", "type": "string"}, {"name": "name", "type": "string"}, {"name": "salary", "type": "int"}, {"name": "origin", "type": "string"} ] }
result 结果
An unexpected error occurred! {:error=>#<NoMethodError: undefined method `type_sym' for nil:NilClass>, :backtrace=> ["/home/ec2-user/logstash- 6.1.0/vendor/bundle/jruby/2.3.0/gems/avro- 1.8.2/lib/avro/io.rb:224:in `match_schemas'", "/home/ec2- user/logstash-6.1.0/vendor/bundle/jruby/2.3.0/gems/avro- 1.8.2/lib/avro/io.rb:280:in `read_data'", "/home/ec2- user/logstash-6.1.0/vendor/bundle/jruby/2.3.0/gems/avro- 1.8.2/lib/avro/io.rb:376:in `read_union'", "/home/ec2- user/logstash-6.1.0/vendor/bundle/jruby/2.3.0/gems/avro- 1.8.2/lib/avro/io.rb:309:in `read_data'", "/home/ec2- user/logstash-6.1.0/vendor/bundle/jruby/2.3.0/gems/avro- 1.8.2/lib/avro/io.rb:384:in `block in read_record'", "org/jruby/RubyArray.java:1734:in `each'", "/home/ec2- user/logstash-6.1.0/vendor/bundle/jruby/2.3.0/gems/avro- 1.8.2/lib/avro/io.rb:382:in `read_record'", "/home/ec2- user/logstash-6.1.0/vendor/bundle/jruby/2.3.0/gems/avro- 1.8.2/lib/avro/io.rb:310:in `read_data'", "/home/ec2- user/logstash-6.1.0/vendor/bundle/jruby/2.3.0/gems/avro- 1.8.2/lib/avro/io.rb:275:in `read'", "/home/ec2- user/logstash-6.1.0/vendor/bundle/jruby/2.3.0/gems/ logstash-codec-avro-3.2.3-java/lib/logstash/codecs/ avro.rb:77:in `decode'", "/home/ec2-user/logstash-6.1.0/ vendor/bundle/jruby/2.3.0/gems/logstash-input-kafka- 8.0.2/lib/ logstash/inputs/kafka.rb:254:in `block in thread_runner'", "/home/ec2-user/logstash- 6.1.0/vendor/bundle/jruby/2.3.0/gems/logstash-input-kafka- 8.0.2/lib/logstash/inputs/kafka.rb:253:in `block in thread_runner'"]}
python client python客户端
kafka
library : wasn't able to decode message kafka
库:无法解码消息
from kafka import KafkaConsumer consumer = KafkaConsumer( topics='higee.higee.higee', auto_offset_reset='earliest' ) for message in consumer: message.value.decode('utf-8') >>> 'utf-8' codec can't decode byte 0xe4 in position 6: invalid continuation byte
confluent_kafka
wasn't compatible with python 3 confluent_kafka
与python 3不兼容
Any idea how I can jsonify data in elasticsearch? 知道如何在Elasticsearch中对数据进行JSON处理吗? Following are sources I searched.
以下是我搜索的资源。
Thanks in advance. 提前致谢。
Some attempts 一些尝试
1) I've changed my connect-mongo-source.properties file as follows to test transformation. 1)我已经如下更改了connect-mongo-source.properties文件以测试转换。
$ cat etc/kafka/connect-mongo-source.properties
>>>
name=mongodb-source-connector
connector.class=io.debezium.connector.mongodb.MongoDbConnector
mongodb.hosts=''
initial.sync.max.threads=1
tasks.max=1
mongodb.name=higee
transforms=unwrap
transforms.unwrap.type = io.debezium.connector.mongodbtransforms.UnwrapFromMongoDbEnvelope
And following is error log I got. 以下是我得到的错误日志。 Not yet being comfortable with Kafka and more importantly debezium platform, I wasn't able to debug this error.
对于Kafka以及更重要的debezium平台还不满意,我无法调试此错误。
ERROR WorkerSourceTask{id=mongodb-source-connector-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
org.bson.json.JsonParseException: JSON reader expected a string but found '0'.
at org.bson.json.JsonReader.visitBinDataExtendedJson(JsonReader.java:904)
at org.bson.json.JsonReader.visitExtendedJSON(JsonReader.java:570)
at org.bson.json.JsonReader.readBsonType(JsonReader.java:145)
at org.bson.codecs.BsonDocumentCodec.decode(BsonDocumentCodec.java:82)
at org.bson.codecs.BsonDocumentCodec.decode(BsonDocumentCodec.java:41)
at org.bson.codecs.BsonDocumentCodec.readValue(BsonDocumentCodec.java:101)
at org.bson.codecs.BsonDocumentCodec.decode(BsonDocumentCodec.java:84)
at org.bson.BsonDocument.parse(BsonDocument.java:62)
at io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope.apply(UnwrapFromMongoDbEnvelope.java:45)
at org.apache.kafka.connect.runtime.TransformationChain.apply(TransformationChain.java:38)
at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:218)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:194)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2) In this time, I've changed elasticsearch.properties and didn't make a change to connect-mongo-source.properties. 2)这一次,我更改了elasticsearch.properties,但未更改connect-mongo-source.properties。
$ cat connect-mongo-source.properties
name=mongodb-source-connector
connector.class=io.debezium.connector.mongodb.MongoDbConnector
mongodb.hosts=''
initial.sync.max.threads=1
tasks.max=1
mongodb.name=higee
$ cat elasticsearch.properties
name=elasticsearch-sink
connector.class = io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=1
topics=higee.higee.higee
key.ignore=true
connection.url=''
type.name=kafka-connect
transforms=unwrap
transforms.unwrap.type = io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope
And I got following error. 而且我得到以下错误。
ERROR WorkerSinkTask{id=elasticsearch-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
org.bson.BsonInvalidOperationException: Document does not contain key $set
at org.bson.BsonDocument.throwIfKeyAbsent(BsonDocument.java:844)
at org.bson.BsonDocument.getDocument(BsonDocument.java:135)
at io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope.apply(UnwrapFromMongoDbEnvelope.java:53)
at org.apache.kafka.connect.runtime.TransformationChain.apply(TransformationChain.java:38)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:480)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:301)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:205)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:173)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
3) changed test.avsc and ran logstash. 3)更改了test.avsc并运行了logstash。 I didn't get any error message but the outcome wasn't something I was expecting in that
origin
, salary
, name
fields were all empty even though they were given non-null values. 我没有收到任何错误消息,但即使在给定非null值的情况下,结果也不是我期望的那样,因为
origin
, salary
和name
字段都是空的。 I was even able to read data through console-consumer properly. 我什至能够通过控制台用户正确读取数据。
$ cat test.avsc
>>>
{
"type" : "record",
"name" : "MongoEvent",
"namespace" : "higee.higee",
"fields" : [ {
"name" : "_id",
"type" : {
"type" : "record",
"name" : "HigeeEvent",
"fields" : [ {
"name" : "$oid",
"type" : "string"
}, {
"name" : "salary",
"type" : "long"
}, {
"name" : "origin",
"type" : "string"
}, {
"name" : "name",
"type" : "string"
} ]
}
} ]
}
$ cat logstash3.conf
>>>
input {
kafka {
bootstrap_servers => ["localhost:9092"]
topics => ["higee.higee.higee"]
auto_offset_reset => "earliest"
codec => avro {
schema_uri => "./test.avsc"
}
}
}
output {
stdout {
codec => rubydebug
}
}
$ bin/logstash -f logstash3.conf
>>>
{
"@version" => "1",
"_id" => {
"salary" => 0,
"origin" => "",
"$oid" => "",
"name" => ""
},
"@timestamp" => 2018-04-25T09:39:07.962Z
}
You must use the Avro Consumer, otherwise you will get 'utf-8' codec can't decode byte
您必须使用Avro Consumer,否则您将获得
'utf-8' codec can't decode byte
Even this example will not work because you still need the schema registry to lookup the schema. 即使这个例子中是行不通的 ,因为你仍然需要架构注册表中查找的模式。
The prerequisites of Confluent's Python Client says it works with Python 3.x Confluent的Python客户端的前提条件要求它可与Python 3.x一起使用
Nothing is stopping you from using a different client, so not sure why you left it at only trying Python. 没有什么可以阻止您使用其他客户端,因此不确定为什么只尝试使用Python就将其保留了。
$oid
in place of _id
$oid
来代替_id
Your AVSC should actually look like this 您的AVSC实际上应该是这样的
{
"type" : "record",
"name" : "MongoEvent",
"namespace" : "higee.higee",
"fields" : [ {
"name" : "_id",
"type" : {
"type" : "record",
"name" : "HigeeEvent",
"fields" : [ {
"name" : "$oid",
"type" : "string"
}, {
"name" : "salary",
"type" : "long"
}, {
"name" : "origin",
"type" : "string"
}, {
"name" : "name",
"type" : "string"
} ]
}
} ]
}
However, Avro doesn't allow for names starting with anything but a regex of [A-Za-z_]
, so that $oid
would be a problem. 但是, Avro不允许以
[A-Za-z_]
的正则表达式开头的名称开头 ,因此$oid
将是一个问题。
While I would not recommend it (nor have actually tried it), one possible way to get your JSON-encoded Avro data into Logstash from the Avro console consumer could be use the Pipe input plugin 尽管我不推荐这样做(也没有实际尝试过),但从Avro控制台使用者将JSON编码的Avro数据导入Logstash的一种可能方法是使用Pipe输入插件
input {
pipe {
codec => json
command => "/path/to/confluent/bin/kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic higee.higee.higee --from-beginning"
}
}
note that the
after
value is always a string, and that by convention it will contain a JSON representation of the document请注意,
after
值始终是一个字符串,并且按照惯例,它将包含文档的JSON表示形式
http://debezium.io/docs/connectors/mongodb/ http://debezium.io/docs/connectors/mongodb/
I think this also applies to patch
values, but I don't know Debezium, really. 我认为这也适用于
patch
值,但我真的不了解Debezium。
Kafka won't parse the JSON in-flight without the use of a Simple Message Transform (SMT). 如果不使用简单消息转换(SMT),Kafka将不会在运行中解析JSON。 Reading the documentation you linked to, you should probably add these to your Connect Source properties
阅读链接到的文档,您可能应该将它们添加到Connect Source属性中
transforms=unwrap
transforms.unwrap.type=io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope
Also worth pointing out, field flattening is on the roadmap - DBZ-561 值得一提的是,路线图上也将进行场平坦化-DBZ-561
Elasticsearch doesn't parse and process encoded JSON string objects without the use of something like Logstash or its JSON Processor . 如果不使用Logstash或其JSON Processor之类的东西,Elasticsearch不会解析和处理编码的JSON字符串对象。 Rather, it only indexes them as a whole string body.
相反,它仅将它们索引为整个字符串主体。
If I recall correctly, Connect will only apply an Elasticsearch mapping to top-level Avro fields, not nested ones. 如果我没记错的话,Connect只会将Elasticsearch映射应用于顶级Avro字段,而不应用于嵌套字段。
In other words, the mapping that is generated follows this pattern, 换句话说,生成的映射遵循此模式,
"patch": {
"string": "...some JSON object string here..."
},
Where you actually need to be like this - perhaps manually defining your ES index 您实际需要的位置-也许手动定义ES索引
"patch": {
"properties": {
"_id": {
"properties" {
"$oid" : { "type": "text" },
"name" : { "type": "text" },
"salary": { "type": "int" },
"origin": { "type": "text" }
},
Again, not sure if the dollar sign is allowed, though. 同样,不确定是否允许使用美元符号。
If none of the above are working, you could attempt a different connector 如果以上都不起作用,则可以尝试使用其他连接器
I was able to solve this issue using python kafka client. 我能够使用python kafka客户端解决此问题。 Following is new architecture of my pipeline.
以下是我的管道的新架构。
I used python 2 even though Confluent document says that python3 is supported. 即使Confluent文档说支持python3,我也使用了python 2。 Main reason was that there were some python2-syntax code.
主要原因是有一些python2语法代码。 For instance...(Not exactly following line but similar syntax)
例如...(不完全是下面一行,但语法相似)
except NameError, err:
In order to use with Python3 I need to convert above lines into: 为了与Python3配合使用,我需要将上述行转换为:
except NameError as err:
That being said, following is my python code. 话虽如此,以下是我的python代码。 Note that this code is only for prototyping and not for production yet.
请注意,此代码仅用于原型设计,尚不用于生产。
code 码
from confluent_kafka.avro import AvroConsumer c = AvroConsumer({ 'bootstrap.servers': '', 'group.id': 'groupid', 'schema.registry.url': '' }) c.subscribe(['higee.higee.higee']) x = True while x: msg = c.poll(100) if msg: message = msg.value() print(message) x = False c.close()
(after updating a document in mongodb) let's check message
variable (在mongodb中更新文档之后)让我们检查
message
变量
{u'after': None, u'op': u'u', u'patch': u'{ "_id" : {"$oid" : "5adafc0e2a0f383bb63910a6"}, "name" : "higee", "salary" : 100, "origin" : "S Korea"}', u'source': { u'h': 5734791721791032689L, u'initsync': False, u'name': u'higee', u'ns': u'higee.higee', u'ord': 1, u'rs': u'', u'sec': 1524362971, u'version': u'0.7.5'}, u'ts_ms': 1524362971148 }
code 码
patch = message['patch'] patch_dict = eval(patch) patch_dict.pop('_id')
check patch_dict
检查
patch_dict
{'name': 'higee', 'origin': 'S Korea', 'salary': 100}
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
value_schema_str = """
{
"namespace": "higee.higee",
"name": "MongoEvent",
"type": "record",
"fields" : [
{
"name" : "name",
"type" : "string"
},
{
"name" : "origin",
"type" : "string"
},
{
"name" : "salary",
"type" : "int32"
}
]
}
"""
AvroProducerConf = {
'bootstrap.servers': '',
'schema.registry.url': ''
}
value_schema = avro.load('./user.avsc')
avroProducer = AvroProducer(
AvroProducerConf,
default_value_schema=value_schema
)
avroProducer.produce(topic='python', value=patch_dict)
avroProducer.flush()
The only thing left is to make elasticsearch sink connector respond to new topic 'python' by setting configuration in following format. 剩下的唯一事情是通过以以下格式设置配置,使elasticsearch sink连接器响应新主题“ python”。 Everything remains the same except
topics
. 除
topics
外,其他所有内容均保持不变。
name=elasticsearch-sink
connector.class= io.confluent.connect \
elasticsearch.ElasticsearchSinkConnector
tasks.max=1
topics=python
key.ignore=true
connection.url=''
type.name=kafka-connect
Then run the elasticsearch sink connector and have it checked at elasticsearch. 然后运行elasticsearch接收器连接器,并在elasticsearch上对其进行检查。
{
"_index": "zzzz",
"_type": "kafka-connect",
"_id": "zzzz+0+3",
"_score": 1,
"_source": {
"name": "higee",
"origin": "S Korea",
"salary": 100
}
}
+1 to @cricket_007's suggestion - use the io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope
single message transformation. +1 @ cricket_007的建议-使用
io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope
单个消息转换。 You can read more about SMTs and their benefit's here . 您可以在此处阅读有关SMT及其优势的更多信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.