简体   繁体   中英

BigQuery to Data Studio : Show reliable COUNT DISTINCT regardless of the selected period

in my BigQuery project I store event data integrated from Firebase. The granularity and dimension is such that trying to present raw data in Data Studio quickly makes the report become VERY slow (1-2 min per page/interaction).

I then started to think how I could create pre-aggregated tables in BigQuery to speed everything up, but quickly realised COUNT DISTINCT metrics would be a problem with this approach. Let me explain:

SELECT user, date
FROM UNNEST([
  STRUCT("Adam" AS user, "20190923" AS date),
  ("Bob", "20190923"),
  ("Carl", "20190923"),
  ("Adam", "20190924"),
  ("Bob", "20190924"),
  ("Adam", "20190925"),
  ("Carl", "20190925"),
  ("Bob", "20190926")
]) AS website_visits;

+------+----------+
| User |   Date   |
+------+----------+
| Adam | 20190923 |
| Bob  | 20190923 |
| Carl | 20190923 |
| Adam | 20190924 |
| Bob  | 20190924 |
| Adam | 20190925 |
| Carl | 20190925 |
| Bob  | 20190926 |
+------+----------+

The above is a table of website visits.

Clearly, creating a pre-aggregated table like

SELECT date, COUNT(DISTINCT user) FROM website_visits GROUP BY date

has the limitation that the count cannot be aggregated further (or even less, dinamically) to get a total, as doing a SUM would return 8 unique users which is not correct, there are only 3 unique users.

In BigQuery, this is fixed by using HLL_COUNT, which despite the approximation works ok for me.

Now to the big question:

How to do the same so that the result is displayable in Data Studio????

HLL_COUNT.EXTRACT is not available as function in there, and in the reporting I always have to keep in mind that the date range is set by the user however (s)he likes so it's not possible to store a pre-aggregated result for ALL cases...

EDIT 1: APPROX_COUNT_DISTINCT

As per answer from Bobbylank, I tried to use APPROX_COUNT_DISTINCT. However I found that this just seems to move the issue down the line. My fault for not explaining what's over there. Despite being performances acceptable it does not seem possible to me to blend a data source with this calculated metric.

Example: After displaying the amount of unique users in the selected period ( which now works ), I'm also trying to display Average Revenue Per User (ARPU) in Data Studio like Firebase does.

To do this, I have to SUM(REVENUE) / APPROX_COUNT_DISTINCT(USER)

Clearly, REVENUE works ok with pre-aggregation and is available in the raw data. I tried then to blend the raw data with a table containing just user visits. However APPROX_COUNT_DISTINCT can't be used in the blended data definition as calculated metrics are not allowed.

Even trying to use the USER field as a metric with Count Distinct aggregation, despite returning the correct figures when showing revenue and user count separately, when I try to divide them the problem becomes aggregation (apply SUM or AVG to the field and basically the result will be AVG(REVENUE/USERS) for each day).

I also then tried to store REVENUE directly in the visits table, but was reminded by Data Studio that I can't create calculated metrics that I can't mix dimensions and metrics in a calculated field.

APPROX_COUNT_DISTINCT might be more performance friendly for you?

https://support.google.com/datastudio/answer/9189108?hl=en

Otherwise the only way I can think would be to pre-calculate several metrics (eg unique users on that day, 7-day cumulative, 14-day, etc.) as your customer require for each single day.

Or you could provide a 2 page report with both of these methods with the caveat that the first can be used over a time period but will be much slower?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM