简体   繁体   中英

KSQL - INSERT INTO a Stream yields no data

I'm having problems with reading messages from a stream which is being populated using the INSERT INTO KSQL operation.

The steps I've followed are:

I have a stream event_stream which I created from a kafka topic.

CREATE STREAM event_stream (eventType varchar, eventTime varchar, 
sourceHostName varchar) WITH (kafka_topic='events', value_format='json');

SELECT * FROM event_stream; shows messages coming in correctly.

I want to send some of these messages to another topic in kafka, output_events , which I have already created.

I then create a second stream in KSQL:

CREATE STREAM output_stream (eventTime varchar, extraColumn varchar, 
sourceHostName varchar) WITH (kafka_topic='output_events', value_format='json');

Finally, I link the input to the output with the following:

INSERT INTO output_stream SELECT eventTime, 'Extra info' as extraColumn,    
sourceHostName FROM event_stream WHERE eventType = 'MatchingValue';

All of the above seem to complete without errors, however if I run SELECT * FROM output_stream; I get no data. Why is this?

Running the SELECT part of the above query works fine, so I can see matching results are arriving on the topic.

Strangely, if I run DESCRIBE EXTENDED output_stream the message count does indicate that messages are reaching the stream:

Local runtime statistics                                                                        
------------------------                                                                        
messages-per-sec:      0.33   total-messages:        86     last-message: 11/9/18 1:15:43 PM UTC
 failed-messages:         0 failed-messages-per-sec:         0      last-failed:       n/a      
(Statistics of the local KSQL server interaction with the Kafka topic output_events) 

I've also checked the ksql-server logs, but can't seen any errors there.

This is a bug , through unintentional misuse of the CREATE STREAM in the wrong syntax. You're using the variant to 'register' a KSQL stream against an existing topic. For INSERT INTO to work it needs to be a CREATE STREAM target AS SELECT ("CSAS").

Let's work it through. Here I'm using this docker-compose for a test setup.

Populate some dummy data:

docker run --rm --interactive --network cos_default confluentinc/cp-kafkacat kafkacat -b kafka:29092 -t events -P <<EOF
{"eventType":"1", "eventTime" :"2018-11-13-06:34:57", "sourceHostName":"asgard"}
{"eventType":"2", "eventTime" :"2018-11-13-06:35:57", "sourceHostName":"asgard"}
{"eventType":"MatchingValue", "eventTime" :"2018-11-13-06:35:58", "sourceHostName":"asgard"}
EOF

Register the source topic with KSQL:

CREATE STREAM event_stream (eventType varchar, eventTime varchar, sourceHostName varchar) WITH (kafka_topic='events', value_format='json');

Query the stream:

ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' from 'null' to 'earliest'
ksql> SELECT * FROM event_stream;
1542091084660 | null | 1 | 2018-11-13-06:34:57 | asgard
1542091084660 | null | 2 | 2018-11-13-06:35:57 | asgard
1542091785207 | null | MatchingValue | 2018-11-13-06:35:58 | asgard

So looking at the CREATE STREAM that you quote:

CREATE STREAM output_stream (eventTime varchar, extraColumn varchar, sourceHostName varchar) WITH (kafka_topic='output_events', value_format='json');

My guess is that if you run LIST TOPICS; you'll see that this topic already exists on your Kafka broker?

ksql> LIST TOPICS;

Kafka Topic            | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
----------------------------------------------------------------------------------------------------
_confluent-metrics     | false      | 12         | 1                  | 0         | 0
_schemas               | false      | 1          | 1                  | 0         | 0
docker-connect-configs | false      | 1          | 1                  | 0         | 0
docker-connect-offsets | false      | 25         | 1                  | 0         | 0
docker-connect-status  | false      | 5          | 1                  | 0         | 0
events                 | true       | 1          | 1                  | 0         | 0
output_events          | false      | 4          | 1                  | 0         | 0
----------------------------------------------------------------------------------------------------
ksql>

Because if it wasn't, this CREATE STREAM would fail:

ksql> CREATE STREAM output_stream (eventTime varchar, extraColumn varchar, sourceHostName varchar) WITH (kafka_topic='output_events', value_format='json');
Kafka topic does not exist: output_events
ksql>

So making this assumption I'm also creating this topic on my test cluster:

$ docker-compose exec kafka bash -c "kafka-topics --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 4 --topic output_events"

And then creating the stream:

ksql> CREATE STREAM output_stream (eventTime varchar, extraColumn varchar, sourceHostName varchar) WITH (kafka_topic='output_events', value_format='json');

Message
----------------
Stream created
----------------

Note that it says Stream created , rather than Stream created and running

Now let's run the INSERT INTO :

ksql> INSERT INTO output_stream SELECT eventTime, 'Extra info' as extraColumn, sourceHostName FROM event_stream WHERE eventType = 'MatchingValue';

Message
-------------------------------
Insert Into query is running.
-------------------------------

The DESCRIBE EXTENDED output does indeed show, as you have seen, messages being processed:

ksql> DESCRIBE EXTENDED output_stream;

Name                 : OUTPUT_STREAM
Type                 : STREAM
Key field            :
Key format           : STRING
Timestamp field      : Not set - using <ROWTIME>
Value format         : JSON
Kafka topic          : output_events (partitions: 4, replication: 1)

Field          | Type
--------------------------------------------
ROWTIME        | BIGINT           (system)
ROWKEY         | VARCHAR(STRING)  (system)
EVENTTIME      | VARCHAR(STRING)
EXTRACOLUMN    | VARCHAR(STRING)
SOURCEHOSTNAME | VARCHAR(STRING)
--------------------------------------------

Queries that write into this STREAM
-----------------------------------
InsertQuery_0 : INSERT INTO output_stream SELECT eventTime, 'Extra info' as extraColumn, sourceHostName FROM event_stream WHERE eventType = 'MatchingValue';

For query topology and execution plan please run: EXPLAIN <QueryId>

Local runtime statistics
------------------------
messages-per-sec:      0.01   total-messages:         1     last-message: 11/13/18 6:49:46 AM UTC
failed-messages:         0 failed-messages-per-sec:         0      last-failed:       n/a
(Statistics of the local KSQL server interaction with the Kafka topic output_events)

But the topic itself has no messages:

ksql> print 'output_events' from beginning;
^C

nor the KSQL Stream:

ksql> SELECT * FROM OUTPUT_STREAM;
^CQuery terminated

So the INSERT INTO command is designed to run against an existing CSAS/CTAS target stream, rather than a source STREAM registered against an existing topic.

Let's try it that way instead. First, we need to drop the existing stream definition, and to do that also terminate the INSERT INTO query:

ksql> DROP STREAM OUTPUT_STREAM;
Cannot drop OUTPUT_STREAM.
The following queries read from this source: [].
The following queries write into this source: [InsertQuery_0].
You need to terminate them before dropping OUTPUT_STREAM.
ksql> TERMINATE InsertQuery_0;

Message
-------------------
Query terminated.
-------------------
ksql> DROP STREAM OUTPUT_STREAM;

Message
------------------------------------
Source OUTPUT_STREAM was dropped.
------------------------------------

Now create the target stream:

ksql> CREATE STREAM output_stream WITH (kafka_topic='output_events') AS SELECT eventTime, 'Extra info' as extraColumn, sourceHostName FROM event_stream WHERE eventType = 'MatchingValue';

Message
----------------------------
Stream created and running
----------------------------

Note that in creating the stream it is also running (vs before it was just created ). Now query the stream:

ksql> SELECT * FROM OUTPUT_STREAM;
1542091785207 | null | 2018-11-13-06:35:58 | Extra info | asgard

and check the underlying topic too:

ksql> PRINT 'output_events' FROM BEGINNING;
Format:JSON
{"ROWTIME":1542091785207,"ROWKEY":"null","EVENTTIME":"2018-11-13-06:35:58","EXTRACOLUMN":"Extra info","SOURCEHOSTNAME":"asgard"}

So, you have hit a bug in KSQL ( raised here ), but one that fortunately can be avoided by using a simpler KSQL syntax entirely, combining your CREATE STREAM and INSERT INTO queries into one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM