简体   繁体   English

如何修复 BigQuery 中的计划查询未正确返回结果

[英]How to fix scheduled query in BigQuery not returning results correctly

I've written a few queries for BigQuery in standard SQL that are scheduled to run every day at 5AM UTC+2, but they return null or empty results.我已经用标准 SQL 编写了一些 BigQuery 查询,这些查询计划在每天凌晨 5 点 UTC+2 运行,但它们返回 null 或空结果。

Specifically, the queries are looking at the previous day's data, so they take the @run_date parameter and subtract 1 day to get the correct date to be used as the _table_suffix (the data is column partitioned).具体来说,查询正在查看前一天的数据,因此它们采用@run_date参数并减去 1 天以获得用作_table_suffix的正确日期(数据是列分区的)。

I've tested running the queries manually, both from console (switching out @run_date for CURRENT_DATE() ) and by manually scheduling a backfill, using the current date as the run date.我已经测试了手动运行查询,无论是从控制台(为CURRENT_DATE()切换出@run_date )还是通过手动安排回填,使用当前日期作为运行日期。

WITH installs AS (
    SELECT 
        COUNTIF(event_name="first_open" AND app_info.install_source="manual_install") AS manualInstalls,
        COUNTIF(event_name="first_open" AND app_info.install_source <> "manual_install") AS appStoreOtherInstalls,
        COUNTIF(event_name="first_open") AS installsTotal,
        COUNTIF(event_name="app_remove" AND user_id IS NOT NULL) AS uninstallRegistered,
        COUNTIF(event_name="app_remove" AND user_id IS NULL) AS uninstallUnregistered,
        COUNTIF(event_name="app_remove") AS uninstallsTotal
    FROM `project.dataset.events_*`
    WHERE _table_suffix=CAST(FORMAT_DATE("%Y%m%d",DATE_SUB(@run_date, INTERVAL 1 DAY)) AS STRING)
),

maus AS (
    SELECT 

    COUNT( DISTINCT
        (SELECT 
            IF(
                REGEXP_CONTAINS(user_id, 
                CASE
                    WHEN user_id LIKE "+237%" THEN r"\+237((67)|(8[0-3])|(5[0-4]))"
                    WHEN user_id LIKE "+27%" THEN r"\+27((83)|(810)|(6((0[3-5])|(3([0-5]|[8-9]))|(40)|(5[5-7])))|7(1(0|[7-9])|(3|8)))"
                    WHEN user_id LIKE "+225%" THEN r"\+225(0|[4-8])[4-6]"
                    WHEN user_id LIKE "+242%" THEN r"\+24206[4-9]"
                END
                ),
            user_id, "")
        )
    ) AS mauNetwork,

    count(DISTINCT user_id) AS mau
    FROM `project.dataset.events_*`
    WHERE _table_suffix BETWEEN CAST(FORMAT_DATE("%Y%m%d",DATE_SUB(@run_date, INTERVAL 30 DAY)) AS STRING) 
    AND CAST(FORMAT_DATE("%Y%m%d",DATE_SUB(@run_date, INTERVAL 1 DAY)) AS STRING)
)

SELECT 
    CAST(FORMAT_DATE("%Y%m%d", DATE_SUB(@run_date, INTERVAL 1 DAY)) AS STRING) AS event_date, 
    IF(installs.manualInstalls IS NOT NULL, installs.manualInstalls, 0) AS manualInstalls, 
    IF(installs.appStoreOtherInstalls IS NOT NULL, installs.appStoreOtherInstalls, 0) AS appStoreOtherInstalls,
    IF(installs.installsTotal IS NOT NULL, installs.installsTotal, 0) AS installsTotal,
    IF(installs.uninstallRegistered IS NOT NULL, installs.uninstallRegistered, 0) AS uninstallRegistered, 
    IF(installs.uninstallUnregistered IS NOT NULL, installs.uninstallUnregistered, 0) AS uninstallNotRegistered, 
    IF(installs.uninstallsTotal IS NOT NULL, installs.uninstallsTotal, 0) AS uninstallsTotal,
    IF(maus.mau IS NOT NULL, maus.mau, 0) AS mau,
    IF(maus.mauNetwork IS NOT NULL, maus.mauNetwork, 0) AS mauNetwork
FROM installs, signUps, daus, maus

I expected all columns to be populated, but they're zero-filled since they're null.我希望填充所有列,但由于它们为空,因此它们是零填充的。 The subquery block called maus does actually return results, but it has a much longer window - I suspect there's maybe something wrong with my logic around retrieving the previous day's data the way that I am.名为maus的子查询块实际上确实返回了结果,但它有一个更长的窗口 - 我怀疑我的逻辑可能有问题,即以我的方式检索前一天的数据。

I'd really appreciate any insight / suggestions!我真的很感激任何见解/建议!

The underlying issue was that the tables were simply not available at 5am.潜在的问题是桌子在早上 5 点根本不可用。 There are two factors to consider when dealing with Firebase exports to BigQuery.处理 Firebase 到 BigQuery 的导出时需要考虑两个因素。

Firebase stores data for the current day in an events_intraday_* table, and when it decides to close off that table, it is written to a permanent events_* table. Firebase 将当天的数据存储在events_intraday_*表中,当它决定关闭该表时,它会写入永久events_*表。 This process can take some time and may take a full day, though this scenario is more rare.此过程可能需要一些时间,可能需要一整天,但这种情况更为罕见。

Then, depending on your timezone settings in Firebase itself, the tables will close even later in the day.然后,根据您在 Firebase 本身中的时区设置,表格甚至会在当天晚些时候关闭。

Basically, run the queries later and everything works as expected.基本上,稍后运行查询,一切都按预期进行。 Unfortunately, sometimes this still doesn't work due to processing running way later than anticipated so there are two options - write a custom scheduler that has internal logic for checking the availability of tables, or write the queries in such a way that they themselves check for data availability first.不幸的是,有时这仍然不起作用,因为处理运行时间比预期晚,所以有两种选择 - 编写一个具有内部逻辑的自定义调度程序来检查表的可用性,或者以他们自己检查的方式编写查询首先是数据可用性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM