简体   繁体   中英

Firestore BigQuery extension - performance

I am using the Firestore BigQuery extension to stream data to Google BigQuery. This data is stored in json format so I think best practice is to generate schema views with this library

When I am now running my BI tool on those views to aggregate and filter some data I see a really poor performance for the resulting queries in BigQuery.

Is there an approach to get this done better? I was thinking to use materialized views but the schema view scripts is already building view upon other views and you can't do that with materialized views. I think I am missing something in my whole setup because I am only talking about collections with a few thousands of records in it

EDIT A real time example from my prod environment

Raw data.table transaction_raw_changelog, coming directly from the Firestore extension在此处输入图像描述

Generating schema views creates 2 views

  • transaction_raw_latest
-- Retrieves the latest document change events for all live documents.
--   timestamp: The Firestore timestamp at which the event took place.
--   operation: One of INSERT, UPDATE, DELETE, IMPORT.
--   event_id: The id of the event that triggered the cloud function mirrored the event.
--   data: A raw JSON payload of the current state of the document.
--   document_id: The document id as defined in the Firestore database
SELECT
  document_name,
  document_id,
  timestamp,
  event_id,
  operation,
  data
FROM
  (
    SELECT
      document_name,
      document_id,
      FIRST_VALUE(timestamp) OVER(
        PARTITION BY document_name
        ORDER BY
          timestamp DESC
      ) AS timestamp,
      FIRST_VALUE(event_id) OVER(
        PARTITION BY document_name
        ORDER BY
          timestamp DESC
      ) AS event_id,
      FIRST_VALUE(operation) OVER(
        PARTITION BY document_name
        ORDER BY
          timestamp DESC
      ) AS operation,
      FIRST_VALUE(data) OVER(
        PARTITION BY document_name
        ORDER BY
          timestamp DESC
      ) AS data,
      FIRST_VALUE(operation) OVER(
        PARTITION BY document_name
        ORDER BY
          timestamp DESC
      ) = "DELETE" AS is_deleted
    FROM
      `swipedrinks-app.transaction.transaction_raw_changelog`
    ORDER BY
      document_name,
      timestamp DESC
  )
WHERE
  NOT is_deleted
GROUP BY
  document_name,
  document_id,
  timestamp,
  event_id,
  operation,
  data
  • transaction_schema_transaction_schema_latest
-- Given a user-defined schema over a raw JSON changelog, returns the
-- schema elements of the latest set of live documents in the collection.
--   timestamp: The Firestore timestamp at which the event took place.
--   operation: One of INSERT, UPDATE, DELETE, IMPORT.
--   event_id: The event that wrote this row.
--   <schema-fields>: This can be one, many, or no typed-columns
--                    corresponding to fields defined in the schema.
SELECT
  *
EXCEPT
  (orderitem)
FROM
  (
    SELECT
      document_name,
      document_id,
      timestamp,
      operation,
      amount,
      bartenderId,
      eventStandId,
      event_id,
      paymentMethod,
      type,
      orderitem,
      toUserId
    FROM
      (
        SELECT
          document_name,
          document_id,
          FIRST_VALUE(timestamp) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS timestamp,
          FIRST_VALUE(operation) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS operation,
          FIRST_VALUE(operation) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) = "DELETE" AS is_deleted,
          `swipedrinks-app.transaction.firestoreNumber`(
            FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.amount')) OVER(
              PARTITION BY document_name
              ORDER BY
                timestamp DESC
            )
          ) AS amount,
          FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.bartenderId')) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS bartenderId,
          FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.eventStandId')) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS eventStandId,
          FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.event_id')) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS event_id,
          `swipedrinks-app.transaction.firestoreNumber`(
            FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.paymentMethod')) OVER(
              PARTITION BY document_name
              ORDER BY
                timestamp DESC
            )
          ) AS paymentMethod,
          FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.type')) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS type,
          `swipedrinks-app.transaction.firestoreArray`(
            FIRST_VALUE(JSON_EXTRACT(data, '$.order')) OVER(
              PARTITION BY document_name
              ORDER BY
                timestamp DESC
            )
          ) AS orderitem,
          FIRST_VALUE(JSON_EXTRACT_SCALAR(data, '$.toUserId')) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS toUserId
        FROM
          `swipedrinks-app.transaction.transaction_raw_latest`
      )
    WHERE
      NOT is_deleted
  ) transaction_raw_latest
  LEFT JOIN UNNEST(transaction_raw_latest.orderitem) AS orderitem_member WITH OFFSET orderitem_index
GROUP BY
  document_name,
  document_id,
  timestamp,
  operation,
  amount,
  bartenderId,
  eventStandId,
  event_id,
  paymentMethod,
  type,
  toUserId,
  orderitem_index,
  orderitem_member

The view transaction_schema_transaction_schema latest is the view easy to query with all my recent data and columns per collection document property.

I'd like to query for the sum of amounts per transaction per event_id

SELECT sum(amount) FROM `swipedrinks-app.transaction.transaction_schema_transaction_schema_latest`

This query takes around 12s, I have 196697 rows in this table

在此处输入图像描述

This may help:

  1. Try to use Standard JSON extraction functions like JSON_VALUE instead of Legacy JSON extraction functions like JSON_EXTRACT_SCALA , as the docs says:

While these functions are supported by Google Standard SQL, we recommend using the functions in the previous table.

  1. Try to use BigQuery BI Engine , in some cases with this optimization activated it helps to reduze the actual BigQuery time to response. As far as I know 12s to a response in BQ is not an issue, you have to look if this time scale as your data growths, maybe is not great for your solution, have you consider Cloud Bigtable ? Take a look at this material .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM