简体   繁体   English

Kafka Connect:JDBC源连接器:创建具有多个分区的主题

[英]Kafka Connect : JDBC Source Connector : create Topic with multiple partitions

I have created a sample pipeline polling data from MySQL and write to HDFS(hive table as well). 我已经创建了一个示例管道,它从MySQL轮询数据并写入HDFS(以及蜂巢表)。

Due to my requirements,I need to create Source+Connector pair for each db table. 根据我的要求,我需要为每个数据库表创建Source + Connector对。 Following I have posted the configuration settings for my Source and Sink Connectors. 接下来,我发布了源连接器和接收器连接器的配置设置。

I can see a topic is created with one partition and with replication factor of 1. 我可以看到一个主题创建有一个分区,复制因子为1。

Topic creation should be automatic, means I cant create topics manually prior to creating Source+Sink pair. 主题创建应该是自动的,这意味着在创建Source + Sink对之前,我无法手动创建主题。

My questions: 我的问题:

1) Is there a way to configure the number of partitions and replication factor when creating the Source Connector? 1)创建源连接器时,是否可以配置分区数和复制因子?

2) If its possible to create multiple partitions, what kind of partitioning strategy does the Source Connector use? 2)如果可以创建多个分区,则源连接器使用哪种分区策略?

3) Whats the correct number of workers should be created for Source and Sink Connectors? 3)应该为源和接收器连接器创建多少正确的工人?

Source Connector: 源连接器:

{
  "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
  "mode": "timestamp+incrementing",
  "timestamp.column.name": "modified",
  "incrementing.column.name": "id",
  "topic.prefix": "jdbc_var_cols-",
  "tasks.max": "1",
  "poll.interval.ms": "1000",
  "query": "SELECT id,name,email,department,modified FROM test",
  "connection.url": "jdbc:mariadb://127.0.0.1:3306/connect_test?user=root&password=confluent"
}

Sink Connector: 接收器连接器:

{
  "connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
  "topics.dir": "/user/datalake/topics-hive-var_cols3",
  "hadoop.conf.dir": "/tmp/quickstart/hadoop/conf",
  "flush.size": "5",
  "schema.compatibility": "BACKWARD",
  "connect.hdfs.principal": "datalake@MYREALM.LOCAL",
  "connect.hdfs.keytab": "/tmp/quickstart/datalake.keytab",
  "tasks.max": "3",
  "topics": "jdbc_var_cols-",
  "hdfs.url": "hdfs://mycluster:8020",
  "hive.database": "kafka_connect_db_var_cols3",
  "hdfs.authentication.kerberos": "true",
  "rotate.interval.ms": "1000",
  "hive.metastore.uris": "thrift://hive_server:9083",
  "hadoop.home": "/tmp/quickstart/hadoop",
  "logs.dir": "/logs",
  "format.class": "io.confluent.connect.hdfs.avro.AvroFormat",
  "hive.integration": "true",
  "hdfs.namenode.principal": "nn/_HOST@MYREALM.LOCAL",
  "hive.conf.dir": "/tmp/quickstart/hadoop/conf"
}

1) Is there a way to configure the number of partitions and replication factor when creating the Source Connector? 1)创建源连接器时,是否可以配置分区数和复制因子?

Not from Connect, no. 不是来自Connect,不是。

Sound like you have auto topic creation enabled on the broker, so it's using the defaults. 听起来好像您在代理上启用了自动主题创建,所以它正在使用默认值。 This should ideally be disabled in a production environment and therefore you must create the topics ahead of time. 理想情况下,应在生产环境中禁用此功能,因此您必须提前创建主题。

what kind of partitioning strategy does the Source Connector use? 源连接器使用哪种分区策略?

Depends on which Connector and how the code is written (ie if/how it generates a Record's key). 取决于哪个连接器以及如何编写代码(即,是否/如何生成记录的键)。 Let's say for example, with JDBC connector, the key might be the primary key of your database table. 例如,假设使用JDBC连接器,则该键可能是数据库表的主键。 It would be hashed using the DefaultPartitioner. 它将使用DefaultPartitioner进行哈希处理。 I do not believe Connect allows you to specify a custom partitioner at a per-connector level. 我不认为Connect允许您在每个连接器级别指定自定义分区程序。 If keys were null, then messages would be distributed over all partitions. 如果键为空,则消息将分布在所有分区上。

3) Whats the correct number of workers should be created for Source and Sink Connectors? 3)应该为源和接收器连接器创建多少正确的工人?

Again, depends on the source. 再次,取决于来源。 For JDBC, you would have one task per table. 对于JDBC,每个表只有一个任务。

For sinks, though, the tasks can only be up-to the number of partitions for the topics being sinked (as with all consumer groups). 但是,对于接收器,任务只能达到接收到的主题的分区数(与所有消费者组一样)。


Also, you typically would run Connect cluster separately from your database (and Hadoop cluster) 此外,您通常会与数据库(和Hadoop群集)分开运行Connect群集

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM