简体   繁体   中英

Left join workaround when creating materialized view in Bigquery

I have two tables under dataset: my_dataset in BigQuery

  1. events table

在此处输入图像描述

  1. event_attributes table

在此处输入图像描述

Data under these tables is populated via Dataflow job running in GCP.

In event_attributes table there are 4 distinct values of attribute_number: [1,2,3,4].

For a given events_fk value, there can maximum 4 rows in event_attributes table ie value of all_count in below query can be max 4.

select events_fk, count(*) as all_count from events_attributes group by events_fk order by all_count limit 1

Now, I want to create a materialized view, which has count of all events, count of completed events (from events table) grouped by time, app, completed, attribute_value(from event_attributes table) . For this, I'm trying to use below query

create materialized view my_dataset.test_mv as (
  select 
  extract(DATE FROM timestamp_seconds(epoch_seconds_utc)) as date,
  extract(HOUR FROM timestamp_seconds(epoch_seconds_utc)) as hour,
  app,
  completed,
  attribute_value,
  count(*) as all_count,
  sum(cast(completed as tinyint)) as completed_count
  from `my_dataset.events` c
  left join unnest(
    (
      select array_agg( struct(id, events_fk, attribute_number, attribute_value) ) as eattr
      from `my_dataset.event_attributes`
    )
  ) a
  on c.id = a.events_fk
  group by date, hour, app, completed, attribute_value
)

Since, left join is not supported in creating materialized view as of now, I was trying to use unnest() and provide an array arguments inside that as a work around. But, bigquery complains about this with below error

在此处输入图像描述

I'm guessing it's because of the subquery that I'm using under unnest().

Reason for trying to create materialized view: I want to use the created materialized view for constructing different reports.

My requirements are-

  1. Have to combine the data in two different table and use resulting table to create materialized view.
  2. Have all rows of events table considered for the final materialized view calculation(which can't be done if I use inner join).

I've also thought of having combined data of these two tables in the dataflow stage itself. But that will be a long route as of now. Since, I will have to think about the changes in dataflow job.

Question: Need help with the create materialized view query. If query is not possible what should be the best course of action in such situation? Any suggestions will be very helpful.

Thanks in advance!!

Edit: Sample data can be constructed via below queries

create table my_dataset.events (
  id INTEGER,
  epoch_seconds_utc INTEGER,
  app STRING,
  engaged BOOLEAN,
  completed BOOLEAN
);

create table my_dataset.event_attributes (
  id INTEGER,
  events_fk INTEGER,
  attribute_number STRING,
  attribute_value STRING
)


-- POPULATE EVENTS DATA
insert into `my_dataset.events` values (435515, 1660807131, 'airtel', FALSE, FALSE);
insert into `my_dataset.events` values (435517, 1660807514, 'vodafone', FALSE, FALSE);
insert into `my_dataset.events` values (285111, 1607927586, 'telstra', FALSE, FALSE);
insert into `my_dataset.events` values (435051, 1649137598, 'healthways', FALSE, FALSE);
insert into `my_dataset.events` values (435017, 1649068151, 'tangoe', FALSE, FALSE);
insert into `my_dataset.events` values (435519, 1660821334, 'airtel', FALSE, FALSE);
insert into `my_dataset.events` values (703, 1589840471,    'vodafone', FALSE, FALSE);
insert into `my_dataset.events` values (1619, 1591014513, 'telstra', FALSE, FALSE);
insert into `my_dataset.events` values (783, 1589840537, 'tangoe', FALSE, FALSE);
insert into `my_dataset.events` values (20245, 1596825490, 'healthways', FALSE, FALSE);


-- POPULATE EVENT ATTRIBUTES DATA
insert into `my_dataset.event_attributes` values (6341, 435017, '1',    'In Progress');
insert into `my_dataset.event_attributes` values (6381, 435051, '1',    'In Progress');
insert into `my_dataset.event_attributes` values (5429, 285111, '1',    '9191919191');
insert into `my_dataset.event_attributes` values (5431, 285111, '2',    'Direct Connection');
insert into `my_dataset.event_attributes` values (5433, 285111, '3',    'English');
insert into `my_dataset.event_attributes` values (5435, 285111, '4',    'Covid');
insert into `my_dataset.event_attributes` values (7161, 435517, '3',    'English');
insert into `my_dataset.event_attributes` values (7159, 435517, '2',    'Some value');
insert into `my_dataset.event_attributes` values (7151, 435515, '2', 'Some value');
insert into `my_dataset.event_attributes` values (7163, 435517, '4',    'Default');
insert into `my_dataset.event_attributes` values (7157, 435517, '1',    '50000');
insert into `my_dataset.event_attributes` values (7149, 435515, '1',    '50000');
insert into `my_dataset.event_attributes` values (7153, 435515, '3',    'English');
insert into `my_dataset.event_attributes` values (7155, 435515, '4',    'Default');

In this if you perform an inner join using query like

select e.id, ea.attribute_number, ea.attribute_value 

from my_dataset.events e left join my_dataset.event_attributes ea on e.id=ea.events_fk

I get around 14 rows, and if I use left join in above query, I get around 19 rows because there are records in events table which don't have corresponding entry in event_attributes.

Desired output: want to get all rows of events table considered to create materialized view.

Let me know if any information is needed from my end.

As Consulted with a Googler:

Material view is good when writing to source is not as much as reading from the sources and having a pre-processed data will save time or cost, or when incremental refresh would benefit. [1]

With left join not supported, you may consider following:

  • Denormalizing tables into one table: combining two source tables into one would get rid of the need to use join statements.
  • Using scheduled query to create a new table using the query used for the materialized view: while this option doesn't run the query every time data is read from the source, this could result in data staleness.
  • Recreating table that runs query used for the materialized view inside the Dataflow job: while this would not lead to data staleness, this does not support incremental refresh and may have performance difference.

For comparison of other resource types, refer to [2].

[1] https://cloud.google.com/bigquery/docs/materialized-views-intro#use_cases

[2] https://cloud.google.com/bigquery/docs/materialized-views-intro#comparison

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM