简体   繁体   中英

Data stops pushing to bigquery

I'm trying to load data from postgresql db to bigquery by using fetch with cursor and have the issue that it wouldn't push enough data to BQ, it only inserted the first batch (1000 rows) and gave no error in log as well.

If this code is run on my laptop, it will work well. But, it's a different story if I bring it to GCP Composer.

The data only contains 8 columns, 4 columns with int value range from 1 to 20M (like user_id), 2 columns contain string (like user_name, hash), 2 columns contain date value (created_date, dwh_created_date). Total row will be ~100k.

Below is my code. I already tried to input the sleep time for each fetch, cuz I thought that it needs time for processing also maybe Google would have a gap time for API requesting. The data frame contains enough data, so I suspected there should be something else.

with cursor:
    cursor.execute(sql_query)
    while True:
        rows = cursor.fetchmany(1000)
        if not rows:
            break
        logger.info(f"rows :{len(rows)}")
        column_names = [desc[0] for desc in cursor.description]
        logger.info(f"Column name: {column_names}")
        df = pd.DataFrame(rows, columns=column_names)
        df.reset_index(drop=True, inplace=True)
        if schema_dict is not None and selected_column is not None:
            df = df[selected_column]
            df = convert_pandas_datatype(df, schema_dict)
        client.load_table_from_dataframe(
            df,
            table_id,
            job_config=job_config
        )
        # from time import sleep
        # sleep(5)
        # print("sleeping............")
conn.close()

So how can I input enough data to BigQuery..

According to the google cloud documentation to wait for a job to complete you need to use the result() function.The job.result() function will wait for the job to complete. Example: rows = query_job.result() . You can edit your code as below:

with cursor:
    cursor.execute(sql_query)
    while True:
        rows = cursor.fetchmany(1000)
        if not rows:
            break
        logger.info(f"rows :{len(rows)}")
        column_names = [desc[0] for desc in cursor.description]
        logger.info(f"Column name: {column_names}")
        df = pd.DataFrame(rows, columns=column_names)
        df.reset_index(drop=True, inplace=True)
        if schema_dict is not None and selected_column is not None:
            df = df[selected_column]
            df = convert_pandas_datatype(df, schema_dict)
       job= client.load_table_from_dataframe(
            df,
            table_id,
            job_config=job_config
        )
       job.result()
       
conn.close()

Never use the sleep() function unnecessarily because this method suspends the execution of the current thread for a given number of seconds. This will create unnecessary problems. For more information, you can follow this link

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM