简体   繁体   中英

Kusto / Azure Data Explorer - Distinct count in kusto queries

I'm using Application Insights with a customEvent and need to get the number of events with a distinct field.

The event looks something like this:

{
    "statusCode" : 200,
    "some_field": "ABC123QWERTY"
}

I want the number of unique some_field with statusCode 200. I've looked at this question and tried a couple of different queries. Some of them giving different answers. In SQL it would have looked something like this:

SELECT COUNT(DISTINCT my_field) AS Count
FROM customEvents
WHERE statusCode=200

Which one is correct?

1 - dcount with default accuracy

customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field))

17,853 items


2 - Count by my_field and count number of rows

customEvents
| extend my_field = tostring(customDimensions.some_field)
| where customDimensions.statusCode == 200 and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize Count = count() by my_field

17,774 items.


3 - summarize with by some_field

customEvents
| extend StatusCode = tostring(customDimensions["statusCode"]), MyField = tostring(customDimensions["some_field"])
| where timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize any(StatusCode) by MyField
| summarize Count = count() by any_StatusCode

17,626 items.


4 - dcount with higher accuracy?

customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field),4)

17,736 items


5 - count_distinct from preview

customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize count_distinct(tostring(customDimensions.some_field))

17,744 items

According to the learn.microsoft.com it states:

Use dcount and dcountif to count distinct values in a specific column.

And dcount-aggfunction mentions the accuracy:

Returns an estimate of the number of distinct values of expr in the group.

count_distinct seems to be the correct way:

Counts unique values specified by the scalar expression per summary group, or the total number of unique values if the summary group is omitted.

count_distinct() is a new KQL function that returns an accurate result.

dcount() returns an approximate result.
It can be used with a 2nd argument, a constant integer with value 0, 1, 2, 3 or 4 (0 = fast, 1 = default, 2 = accurate, 3 = extra accurate, 4 = super accurate).

  1. In your examples (specifically "4 - dcount with higher accuracy?") you have not used a 2nd argument.
  2. Higher accuracy means higher accuracy - statistically .
    It means that the error will be bound to a lower value.
    Theoretically (and in practice) dcount() with lower accuracy may yield in some scenarios a result that is closer to the real number than dcount() with higher accuracy.

Having said that -

I would guess that you executed your queries with a UI filter of last 24 hours or something similar.
This means that each execution ran over a different timespan.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM