简体   繁体   English

数据停止推送到 bigquery

[英]Data stops pushing to bigquery

I'm trying to load data from postgresql db to bigquery by using fetch with cursor and have the issue that it wouldn't push enough data to BQ, it only inserted the first batch (1000 rows) and gave no error in log as well.我正在尝试通过使用 cursor 的 fetch 将数据从 postgresql db 加载到 bigquery,但问题是它不会将足够的数据推送到 BQ,它只插入了第一批(1000 行)并且在日志中也没有错误.

If this code is run on my laptop, it will work well.如果这段代码在我的笔记本电脑上运行,它会运行良好。 But, it's a different story if I bring it to GCP Composer.但是,如果我把它带到 GCP Composer 中,情况就不同了。

The data only contains 8 columns, 4 columns with int value range from 1 to 20M (like user_id), 2 columns contain string (like user_name, hash), 2 columns contain date value (created_date, dwh_created_date).数据仅包含8列,4列int值范围从1到20M(如user_id),2列包含字符串(如user_name,hash),2列包含日期值(created_date,dwh_created_date)。 Total row will be ~100k.总行数约为 100k。

Below is my code.下面是我的代码。 I already tried to input the sleep time for each fetch, cuz I thought that it needs time for processing also maybe Google would have a gap time for API requesting.我已经尝试为每次提取输入睡眠时间,因为我认为它需要时间来处理,也可能谷歌会对 API 请求有间隔时间。 The data frame contains enough data, so I suspected there should be something else.数据框包含足够的数据,所以我怀疑应该还有其他东西。

with cursor:
    cursor.execute(sql_query)
    while True:
        rows = cursor.fetchmany(1000)
        if not rows:
            break
        logger.info(f"rows :{len(rows)}")
        column_names = [desc[0] for desc in cursor.description]
        logger.info(f"Column name: {column_names}")
        df = pd.DataFrame(rows, columns=column_names)
        df.reset_index(drop=True, inplace=True)
        if schema_dict is not None and selected_column is not None:
            df = df[selected_column]
            df = convert_pandas_datatype(df, schema_dict)
        client.load_table_from_dataframe(
            df,
            table_id,
            job_config=job_config
        )
        # from time import sleep
        # sleep(5)
        # print("sleeping............")
conn.close()

So how can I input enough data to BigQuery..那么我怎样才能输入足够的数据到 BigQuery ..

According to the google cloud documentation to wait for a job to complete you need to use the result() function.The job.result() function will wait for the job to complete.根据谷歌云文档等待作业完成,您需要使用 result() function。job.result() function 将等待作业完成。 Example: rows = query_job.result() .示例: rows = query_job.result() You can edit your code as below:您可以按如下方式编辑代码:

with cursor:
    cursor.execute(sql_query)
    while True:
        rows = cursor.fetchmany(1000)
        if not rows:
            break
        logger.info(f"rows :{len(rows)}")
        column_names = [desc[0] for desc in cursor.description]
        logger.info(f"Column name: {column_names}")
        df = pd.DataFrame(rows, columns=column_names)
        df.reset_index(drop=True, inplace=True)
        if schema_dict is not None and selected_column is not None:
            df = df[selected_column]
            df = convert_pandas_datatype(df, schema_dict)
       job= client.load_table_from_dataframe(
            df,
            table_id,
            job_config=job_config
        )
       job.result()
       
conn.close()

Never use the sleep() function unnecessarily because this method suspends the execution of the current thread for a given number of seconds.切勿不必要地使用sleep() function,因为此方法会将当前线程的执行暂停给定的秒数。 This will create unnecessary problems.这会产生不必要的问题。 For more information, you can follow this link欲了解更多信息,您可以点击此链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM