I have about 200k CSVs(all with same schema). I wrote a Cloud Function for them to insert them to BigQuery such that as soon as I copy the CSV to a bucket, the function is executed and data is loaded to the BigQuery dataset
I basically used the same code as in the documentation.
dataset_id = 'my_dataset' # replace with your dataset ID
table_id = 'my_table' # replace with your table ID
table_ref = bigquery_client.dataset(dataset_id).table(table_id)
table = bigquery_client.get_table(table_ref) # API request
def bigquery_csv(data, context):
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://{}/{}'.format(data['bucket'], data['name'])
errors = bigquery_client.load_table_from_uri(uri,
table_ref,
job_config=job_config) # API request
logging.info(errors)
#print('Starting job {}'.format(load_job.job_id))
# load_job.result() # Waits for table load to complete.
logging.info('Job finished.')
destination_table = bigquery_client.get_table(table_ref)
logging.info('Loaded {} rows.'.format(destination_table.num_rows))
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
I can't figure what's wrong. No insert jobs are being shown in Stackdriver Logging and no functions are running once the copy job is complete.
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
You are hitting BigQuery load limit as defined in this link
You should split your file into smaller file and the upload will work
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.