简体   繁体   English

使用Apache Kafka进行流连接示例?

[英]Stream join example with Apache Kafka?

I was looking for an example using Kafka Streams on how to do this sort of thing, ie join a customers table with a addresses table and sink the data to ES:- 我正在寻找一个使用Kafka Streams如何做这种事情的例子,即将客户表与地址表连接并将数据汇入ES: -

Customers 顾客

+------+------------+----------------+-----------------------+
| id   | first_name | last_name      | email                 |
+------+------------+----------------+-----------------------+
| 1001 | Sally      | Thomas         | sally.thomas@acme.com |
| 1002 | George     | Bailey         | gbailey@foobar.com    |
| 1003 | Edward     | Davidson       | ed@walker.com         |
| 1004 | Anne       | Kim            | annek@noanswer.org    |
+------+------------+----------------+-----------------------+

Addresses 地址

+----+-------------+---------------------------+------------+--------------+-------+----------+
| id | customer_id | street                    | city       | state        | zip   | type     |
+----+-------------+---------------------------+------------+--------------+-------+----------+
| 10 |        1001 | 3183 Moore Avenue         | Euless     | Texas        | 76036 | SHIPPING |
| 11 |        1001 | 2389 Hidden Valley Road   | Harrisburg | Pennsylvania | 17116 | BILLING  |
| 12 |        1002 | 281 Riverside Drive       | Augusta    | Georgia      | 30901 | BILLING  |
| 13 |        1003 | 3787 Brownton Road        | Columbus   | Mississippi  | 39701 | SHIPPING |
| 14 |        1003 | 2458 Lost Creek Road      | Bethlehem  | Pennsylvania | 18018 | SHIPPING |
| 15 |        1003 | 4800 Simpson Square       | Hillsdale  | Oklahoma     | 73743 | BILLING  |
| 16 |        1004 | 1289 University Hill Road | Canehill   | Arkansas     | 72717 | LIVING   |
+----+-------------+---------------------------+------------+--------------+-------+----------+

Output Elasticsearch index 输出Elasticsearch索引

"hits": [
  {
    "_index": "customers_with_addresses",
    "_type": "_doc",
    "_id": "1",
    "_score": 1.3278645,
    "_source": {
      "first_name": "Sally",
      "last_name": "Thomas",
      "email": "sally.thomas@acme.com",
      "addresses": [{
        "street": "3183 Moore Avenue",
        "city": "Euless",
        "state": "Texas",
        "zip": "76036",
        "type": "SHIPPING"
      }, {
        "street": "2389 Hidden Valley Road",
        "city": "Harrisburg",
        "state": "Pennsylvania",
        "zip": "17116",
        "type": "BILLING"
      }],
    }
  }, ….

Table data is coming from Debezium topics, am I correct in thinking I need some Java in the middle to join the streams, output it to a new topic which then sinks that into ES? 表数据来自Debezium主题,我是否正确认为我需要一些中间的Java加入流,将其输出到一个新主题,然后将其汇入ES?

Would anyone have any example code of this? 有人会有任何示例代码吗?

Thanks. 谢谢。

Depending on how strict your requirement is to nest multiple addresses in one customer node, you can do this in KSQL (which is built on top of Kafka Streams). 根据您在一个客户节点中嵌套多个地址的要​​求的严格程度,您可以在KSQL(基于Kafka Streams构建)之上执行此操作。

Populate some test data into Kafka (which in your case is done already through Debezium): 将一些测试数据填充到Kafka中(在您的情况下已经通过Debezium完成):

$ curl -s "https://api.mockaroo.com/api/ffa9ff20?count=10&key=ff7856d0" | kafkacat -b localhost:9092 -t addresses -P

$ curl -s "https://api.mockaroo.com/api/9b868890?count=4&key=ff7856d0" | kafkacat -b localhost:9092 -t customers -P

Fire up KSQL and to start with just inspect the data: 启动KSQL并开始只检查数据:

ksql> PRINT 'addresses' FROM BEGINNING ;
Format:JSON
{"ROWTIME":1558519823351,"ROWKEY":"null","id":1,"customer_id":1004,"street":"8 Moulton Center","city":"Bronx","state":"New York","zip":"10474","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":2,"customer_id":1001,"street":"5 Hollow Ridge Alley","city":"Washington","state":"District of Columbia","zip":"20016","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":3,"customer_id":1000,"street":"58 Maryland Point","city":"Greensboro","state":"North Carolina","zip":"27404","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":4,"customer_id":1002,"street":"55795 Derek Avenue","city":"Temple","state":"Texas","zip":"76505","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":5,"customer_id":1002,"street":"164 Continental Plaza","city":"Modesto","state":"California","zip":"95354","type":"SHIPPING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":6,"customer_id":1004,"street":"6 Miller Road","city":"Louisville","state":"Kentucky","zip":"40205","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":7,"customer_id":1003,"street":"97 Shasta Place","city":"Pittsburgh","state":"Pennsylvania","zip":"15286","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":8,"customer_id":1000,"street":"36 Warbler Circle","city":"Memphis","state":"Tennessee","zip":"38109","type":"SHIPPING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":9,"customer_id":1001,"street":"890 Eagan Circle","city":"Saint Paul","state":"Minnesota","zip":"55103","type":"SHIPPING"}
{"ROWTIME":1558519823354,"ROWKEY":"null","id":10,"customer_id":1000,"street":"8 Judy Terrace","city":"Washington","state":"District of Columbia","zip":"20456","type":"SHIPPING"}
^C
Topic printing ceased

ksql>
ksql> PRINT 'customers' FROM BEGINNING;
Format:JSON
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1001,"first_name":"Jolee","last_name":"Handasyde","email":"jhandasyde0@nhs.uk"}
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1002,"first_name":"Rebeca","last_name":"Kerrod","email":"rkerrod1@sourceforge.net"}
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1003,"first_name":"Bobette","last_name":"Brumble","email":"bbrumble2@cdc.gov"}
{"ROWTIME":1558519852368,"ROWKEY":"null","id":1004,"first_name":"Royal","last_name":"De Biaggi","email":"rdebiaggi3@opera.com"}

Now we declare a STREAM (Kafka topic + schema) on the data so that we can manipulate it further: 现在我们在数据上声明一个STREAM (Kafka主题+模式),以便我们可以进一步操作它:

ksql> CREATE STREAM addresses_RAW (ID INT, CUSTOMER_ID INT, STREET VARCHAR, CITY VARCHAR, STATE VARCHAR, ZIP VARCHAR, TYPE VARCHAR) WITH (KAFKA_TOPIC='addresses', VALUE_FORMAT='JSON');

 Message
----------------
 Stream created
----------------

ksql> CREATE STREAM customers_RAW (ID INT, FIRST_NAME VARCHAR, LAST_NAME VARCHAR, EMAIL VARCHAR) WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON');

 Message
----------------
 Stream created
----------------

We're going to model the customers as a TABLE , and to do that the Kafka messages need to be keyed correctly (and the moment they have null keys, as can be seen from the "ROWKEY":"null" in the PRINT output above). 我们将把customers建模为一个TABLE ,为此,需要正确键入Kafka消息(以及它们具有空键的时刻,从"ROWKEY":"null"可以看出"ROWKEY":"null" PRINT输出中的"ROWKEY":"null"以上)。 You can configure Debezium to set the message key so this step may not be necessary for you in KSQL: 您可以配置Debezium来设置消息密钥,因此在KSQL中可能不需要此步骤:

ksql> CREATE STREAM CUSTOMERS_KEYED WITH (PARTITIONS=1) AS SELECT * FROM CUSTOMERS_RAW PARTITION BY ID;

 Message
----------------------------
 Stream created and running
----------------------------

Now we declare a TABLE ( state for a given key, instantiated from a Kafka topic + schema): 现在我们声明一个TABLE (给定键的状态 ,从Kafka主题+模式实例化):

ksql> CREATE TABLE CUSTOMER (ID INT, FIRST_NAME VARCHAR, LAST_NAME VARCHAR, EMAIL VARCHAR) WITH (KAFKA_TOPIC='CUSTOMERS_KEYED', VALUE_FORMAT='JSON', KEY='ID');

 Message
---------------
 Table created
---------------

Now we can join the data: 现在我们可以加入数据:


ksql> CREATE STREAM customers_with_addresses AS 
      SELECT CUSTOMER_ID, 
             FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME, 
             FIRST_NAME, 
             LAST_NAME, 
             TYPE AS ADDRESS_TYPE, 
             STREET, 
             CITY, 
             STATE, 
             ZIP 
        FROM ADDRESSES_RAW A 
             INNER JOIN CUSTOMER C 
             ON A.CUSTOMER_ID = C.ID;

 Message
----------------------------
 Stream created and running
----------------------------

This creates a new KSQL STREAM which in turn populates a new Kafka topic. 这将创建一个新的KSQL STREAM,它反过来填充一个新的Kafka主题。

ksql> SHOW STREAMS;

 Stream Name                              | Kafka Topic                          | Format
------------------------------------------------------------------------------------------
 CUSTOMERS_KEYED                          | CUSTOMERS_KEYED                      | JSON
 ADDRESSES_RAW                            | addresses                            | JSON
 CUSTOMERS_RAW                            | customers                            | JSON
 CUSTOMERS_WITH_ADDRESSES                 | CUSTOMERS_WITH_ADDRESSES             | JSON

The stream has a schema: 该流有一个架构:

ksql> DESCRIBE CUSTOMERS_WITH_ADDRESSES;

Name                 : CUSTOMERS_WITH_ADDRESSES
 Field        | Type
------------------------------------------
 ROWTIME      | BIGINT           (system)
 ROWKEY       | VARCHAR(STRING)  (system)
 CUSTOMER_ID  | INTEGER          (key)
 FULL_NAME    | VARCHAR(STRING)
 FIRST_NAME   | VARCHAR(STRING)
 ADDRESS_TYPE | VARCHAR(STRING)
 LAST_NAME    | VARCHAR(STRING)
 STREET       | VARCHAR(STRING)
 CITY         | VARCHAR(STRING)
 STATE        | VARCHAR(STRING)
 ZIP          | VARCHAR(STRING)
------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;

We can query the stream: 我们可以查询流:

ksql> SELECT * FROM CUSTOMERS_WITH_ADDRESSES WHERE CUSTOMER_ID=1002;
1558519823351 | 1002 | 1002 | Rebeca Kerrod | Rebeca | LIVING | Kerrod | 55795 Derek Avenue | Temple | Texas | 76505
1558519823351 | 1002 | 1002 | Rebeca Kerrod | Rebeca | SHIPPING | Kerrod | 164 Continental Plaza | Modesto | California | 95354

We can also stream it to Elasticsearch using Kafka Connect: 我们还可以使用Kafka Connect将其流式传输到Elasticsearch:

curl -i -X POST -H "Accept:application/json" \
    -H  "Content-Type:application/json" http://localhost:8083/connectors/ \
    -d '{
      "name": "sink-elastic-customers_with_addresses-00",
      "config": {
        "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
        "topics": "CUSTOMERS_WITH_ADDRESSES",
        "connection.url": "http://elasticsearch:9200",
        "type.name": "type.name=kafkaconnect",
        "key.ignore": "true",
        "schema.ignore": "true",
        "key.converter": "org.apache.kafka.connect.storage.StringConverter",
        "value.converter": "org.apache.kafka.connect.json.JsonConverter",
        "value.converter.schemas.enable": "false"
      }
    }'

Result: 结果:

$ curl -s http://localhost:9200/customers_with_addresses/_search | jq '.hits.hits[0]'
{
  "_index": "customers_with_addresses",
  "_type": "type.name=kafkaconnect",
  "_id": "CUSTOMERS_WITH_ADDRESSES+0+2",
  "_score": 1,
  "_source": {
    "ZIP": "76505",
    "CITY": "Temple",
    "ADDRESS_TYPE": "LIVING",
    "CUSTOMER_ID": 1002,
    "FULL_NAME": "Rebeca Kerrod",
    "STATE": "Texas",
    "STREET": "55795 Derek Avenue",
    "LAST_NAME": "Kerrod",
    "FIRST_NAME": "Rebeca"
  }
}

Yes, You can implement the solution using Kafka streams API in java in following way. 是的,您可以通过以下方式在Java中使用Kafka流API实现解决方案。

  1. Consume the topics as stream. 将主题作为流消费。
  2. Aggregate the address stream in a list using customer ID and convert the stream into table. 使用客户ID在列表中聚合地址流,并将流转换为表。
  3. Join Customer stream with address table 使用地址表加入客户流

Below is the example(considering data is consumed in json format) : 下面是示例(考虑以json格式使用数据):

KStream<String,JsonNode> customers = builder.stream("customer", Consumed.with(stringSerde, jsonNodeSerde));
KStream<String,JsonNode> addresses = builder.stream("address", Consumed.with(stringSerde, jsonNodeSerde));

// Select the customer ID as key in order to join with address. 
KStream<String,JsonNode> customerRekeyed = customers.selectKey(value-> value.get("id").asText());

ObjectMapper mapper = new ObjectMapper();    
// Select Customer_id as key to aggregate the addresses and join with customer
KTable<String,JsonNode> addressTable = addresses
        .selectKey(value-> value.get("customer_id").asText())
        .groupByKey()
        .aggregate(() ->mapper::createObjectNode,  //initializer
                   (key,value,aggregate) -> aggregate.add(value),
                 Materialized.with(stringSerde, jsonNodeSerde)
         );  //adder

// Join Customer Stream with Address Table
KStream<String,JsonNode> customerAddressStream = customerRekeyed.leftJoin(addressTable,
               (left,right) -> {
                      ObjectNode finalNode = mapper.createObjectNode();
                      ArrayList addressList = new ArrayList<JsonNode>();
                      // Considering the address is arrayNode
                      ((ArrayNode)right).elements().forEachRemaining(addressList ::add);
                      left.putArray("addresses").allAll(addressList);              
                      return left;
               },Joined.keySerde(stringSerde).withValueSerde(jsonNodeSerde));

You can refer the details about all type of joins here : 您可以在此处参考所有类型的连接的详细信息:

https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#joining https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#joining

We built a demo and blog post on this very use case (streaming aggregates to Elasticsearch) a while ago on the Debezium blog. 我们不久前在Debezium博客上建立了一个关于这个用例(流式聚合到Elasticsearch)的演示和博客文章

One issue to keep in mind is that this solution (based on Kafka Streams, but I reckon it's the same for KSQL) is prone to exposing intermediary join results. 要记住的一个问题是这个解决方案(基于Kafka Streams,但我认为它对KSQL来说是相同的)很容易暴露中间连接结果。 Eg assume you insert a customer and 10 addresses in one transaction. 假设您在一次交易中插入一个客户和10个地址。 The stream join approach might first produce an aggregate of the customer and their first five addresses and shortly thereafter the complete aggregate with all the 10 addresses. 流连接方法可能首先生成客户及其前五个地址的聚合,然后很快生成具有所有10个地址的完整聚合。 This might or might not be desirable for your specific use case. 对于您的特定用例,这可能是也可能不合适。 I also remember that handling deletions isn't trivial (eg if you delete one of the 10 addresses, so you'll have to produce the aggregate again with the remaining 9 addresses with might have been untouched, though). 我还记得处理删除并不简单(例如,如果你删除10个地址中的一个,那么你将不得不再次生成聚合,其余的9个地址可能没有被触及)。

An alternative to consider can be the outbox pattern where you'd essentially produce an explicit event with the precomputed aggregated from within your application itself. 另一种考虑方法可以是发件箱模式 ,您实际上会在应用程序本身内使用预先计算的聚合生成显式事件。 Ie it requires a little help of the application, but then it avoids the subtleties of producing that join result after the fact. 即它需要一些应用程序的帮助,但它避免了事后产生连接结果的微妙之处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM