简体   繁体   中英

KSQL Query returning unexpected values in simple aggregation

I am getting unexpected results from a KSQL query against a KTable that is itself defined by a Kafka topic. The KTABLE is "Trades" and it is backed by the compacted topic "localhost.dbo.TradeHistory". It is supposed to contain the latest information for a stock trade keyed by a TradeId. The topic's key is TradeId. Each trade has an AccountId and I'm trying to construct a query to get the SUM of the Amount(s) of the trades grouped by account.

The Definition of the Trades KTABLE

ksql> create table Trades(TradeId int, AccountId int, Spn int, Amount double) with (KAFKA_TOPIC = 'localhost.dbo.TradeHistory', VALUE_FORMAT = 'JSON', KEY = 'TradeId');

...

ksql> describe extended Trades;

Name                 : TRADES
Type                 : TABLE
Key field            : TRADEID
Key format           : STRING
Timestamp field      : Not set - using <ROWTIME>
Value format         : JSON
Kafka topic          : localhost.dbo.TradeHistory (partitions: 1, replication: 1)

Field     | Type
---------------------------------------
ROWTIME   | BIGINT           (system)
ROWKEY    | VARCHAR(STRING)  (system)
TRADEID   | INTEGER
ACCOUNTID | INTEGER
SPN       | INTEGER
AMOUNT    | DOUBLE
---------------------------------------

Local runtime statistics
------------------------
consumer-messages-per-sec:         0 consumer-total-bytes:      3709 consumer-total-messages:        39     last-message: 2019-10-12T20:52:16.552Z

(Statistics of the local KSQL server interaction with the Kafka topic localhost.dbo.TradeHistory)

The Configuration of the localhost.dbo.TradeHistory Topic

/usr/bin/kafka-topics --zookeeper zookeeper:2181 --describe --topic localhost.dbo.TradeHistory
Topic:localhost.dbo.TradeHistory    PartitionCount:1    ReplicationFactor:1 Configs:min.cleanable.dirty.ratio=0.01,delete.retention.ms=100,cleanup.policy=compact,segment.ms=100
    Topic: localhost.dbo.TradeHistory   Partition: 0    Leader: 1   Replicas: 1 Isr: 1

In my test, I'm adding messages to the localhost.dbo.TradeHistory topic with TradeId 2 that simply change the amount of the trade. Only the Amount is updated; the AccountId remains 1.

The messages in the localhost.dbo.TradeHistory topic

/usr/bin/kafka-console-consumer --bootstrap-server broker:9092 --property print.key=true --topic localhost.dbo.TradeHistory --from-beginning

... (earlier values redacted) ...

2   {"TradeHistoryId":47,"TradeId":2,"AccountId":1,"Spn":1,"Amount":106.0,"__table":"TradeHistory"}
2   {"TradeHistoryId":48,"TradeId":2,"AccountId":1,"Spn":1,"Amount":107.0,"__table":"TradeHistory"}

The dump of the topic, above, shows the Amount of Trade 2 (in Account 1) changing from 106.0 to 107.0.

The KSQL Query

ksql> select AccountId, count(*) as Count, sum(Amount) as Total from Trades group by AccountId;
1 | 1 | 106.0
1 | 0 | 0.0
1 | 1 | 107.0

The question is, why does the KSQL query shown above return an "intermediate" value each time I publish a trade update. As you can see, the Count and the Amount fields show 0,0 and then the KSQL query immediately "corrects" it to 1,107.0. I'm a bit confused by this behavior.

Can anyone explain it?

Many thanks.

Thanks for your question. I've added an answer to our knowledge base: https://github.com/confluentinc/ksql/pull/3594/files .

When KSQL sees an update to an existing row in a table it internally emits a CDC event, which contains the old and new value. Aggregations handle this by first undoing the old value, before applying the new value.

So, in the example above, when the second insert happens, KSQL first undos the old value. This results in the COUNT going down by 1, and the SUM going down by the old value of 106.0 , ie going down to zero. Then KSQL applies the new row value, which sees the COUNT going up by 1 and the SUM going up by the new value 107.0 .

By default, KSQL is configured to buffer results for up to 2 seconds, or 10MB of data, before flushing the results to Kafka. This is why you may see a slight delay on the output when inserting values in this example. If both output rows are buffered together then KSQL will suppress the first result. This is why you often do not see the intermediate row being output. The configurations commit.interval.ms and cache.max.bytes.buffering , which are set to 2 seconds and 10MB, respectively, can be used to tune this behaviour. Setting either of these settings to zero will cause KSQL to always output all intermediate results.

If you are seeing these intermediate results output every time, then it's likely you have set one, or both, of these settings to zero.

We have a Github issue to enhance KSQL to make use of Kafka Stream's Suppression functionality, which would allow users more control how results are materialized.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM