简体   繁体   中英

Key while creating KSQL Stream

1) Is Key required on the Stream where you want to perform aggregate function. I have read several blogs and also recommendation from Confluent that KEY is required for aggregation function to work

CREATE STREAM Employee    (EmpId BIGINT,     EmpName VARCHAR,    
DeptId BIGINT,    SAL BIGINT)   WITH (KAFKA_TOPIC='EmpTopic', 
   VALUE_FORMAT='JSON');

While defining above Stream, I have not defined any KEY (ROWKEY is NULL). Underlying topic 'EmpTopic' also does not a KEY.

I am performing aggregation function on the Stream.

CREATE TABLE SALBYDEPT AS
    SELECT DeptId, 
             SUM(SAL) 
      FROM Employee 
      GROUP BY DeptId;

Please confirm whether performing Aggregation function on the above Stream requires a KEY on 'Employee' stream ie NOT NULL ROWKEY on 'Employee' Stream

2) As per Confluent documentation, "Windowing lets you control how to group records that have the same key for stateful operations, like aggregations or joins, into time spans. KSQL tracks windows per record key". Please help me understand the meaning of the above statement. Is it required the Stream should have NOT NULL KEY?

3) Will JOIN on Stream-Table retain the KEY

CREATE TABLE users 
  (registertime BIGINT, 
   userid VARCHAR, 
   gender VARCHAR, 
   regionid VARCHAR) 
  WITH (KAFKA_TOPIC = 'users', 
        VALUE_FORMAT='JSON', 
        KEY = 'userid');

CREATE STREAM pageviews 
  (viewtime BIGINT, 
   userid VARCHAR, 
   pageid VARCHAR) 
  WITH (KAFKA_TOPIC='pageviews', 
        VALUE_FORMAT='DELIMITED', 
        KEY='pageid', 
        TIMESTAMP='viewtime');

CREATE STREAM pageviews_transformed as 
  SELECT viewtime, 
         userid, 
         pageid, 
         TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring 
  FROM pageviews 

CREATE STREAM pageviews_enriched AS 
  SELECT pv.viewtime, 
         pv.userid AS userid, 
         pv.pageid, 
         pv.timestring, 
         u.gender, 
         u.regionid, 
         u.interests, 
         u.contactinfo 
  FROM pageviews_transformed pv 
  LEFT JOIN users u ON pv.userid = u.userid;

Will JOIN on Stream-Table retain the 'UserId' as ROWKEY in the new Stream 'pageviews_enriched'

4) I have seen several examples from Confluent on Github where Stream used in JOIN is not KEY'ed. But as per the documentation, Stream should have NOT NULL ROWKEY participating the JOIN. Please confirm to have NOT NULL ROWKEY in the Stream.

Stream-Stream join and Stream-Table join. In the below example I am performing JOIN on Stream with NULL ROWKEY and Table. Is this valid?

CREATE TABLE users 
  (registertime BIGINT, 
   userid VARCHAR, 
   gender VARCHAR, 
   regionid VARCHAR) 
  WITH (KAFKA_TOPIC = 'users', 
        VALUE_FORMAT='JSON', 
        KEY = 'userid');

CREATE STREAM pageviews 
  (viewtime BIGINT, 
   userid VARCHAR, 
   pageid VARCHAR) 
  WITH (KAFKA_TOPIC='pageviews', 
        VALUE_FORMAT='DELIMITED', 
        TIMESTAMP='viewtime');

CREATE STREAM pageviews_transformed as 
  SELECT viewtime, 
         userid, 
         pageid, 
         TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring 
  FROM pageviews 

CREATE STREAM pageviews_enriched AS 
  SELECT pv.viewtime, 
         pv.userid AS userid, 
         pv.pageid, 
         pv.timestring, 
         u.gender, 
         u.regionid, 
         u.interests, 
         u.contactinfo 
  FROM pageviews_transformed pv 
  LEFT JOIN users u ON pv.userid = u.userid;
 CREATE TABLE SALBYDEPT AS SELECT DeptId, SUM(SAL) FROM Employee GROUP BY DeptId; 
  1. Please confirm whether performing Aggregation function on the above Stream requires a KEY on 'Employee' stream ie NOT NULL ROWKEY on 'Employee' Stream

You do not need a key on this stream. The key of the created table will be DeptId .


  1. As per Confluent documentation, "Windowing lets you control how to group records that have the same key for stateful operations, like aggregations or joins, into time spans. KSQL tracks windows per record key". Please help me understand the meaning of the above statement. Is it required the Stream should have NOT NULL KEY?

This means that when you create an aggregation you can do so over a time window, and that time window is part of the message key. For example, instead of aggregating all employee SAL (sales?), you could choose to do so over a time window, perhaps every hour or day. In that case you would have the aggregate key ( DeptId ), combined with the window key (eg for hourly 2019-06-23 06:00:00 , 2019-06-23 07:00:00 , 2019-06-23 08:00:00 etc)


  1. Will JOIN on Stream-Table retain the KEY

It will retain the stream's key, unless you include a PARTITION BY in the DDL.


  1. I have seen several examples from Confluent on Github where Stream used in JOIN is not KEY'ed. But as per the documentation, Stream should have NOT NULL ROWKEY participating the JOIN. Please confirm to have NOT NULL ROWKEY in the Stream.

Do you have a link to the specific documentation you're referencing? Whilst a table does need to be keyed, a stream does not (KSQL may handle this under the covers; I'm not sure).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM