简体   繁体   中英

BigQuery won't skip header row of CSV when loading a table with Python

I am using Python 3.8 to load a csv file into Big Query as a new table, I have the schema defined, auto detect off, and skip_leading_rows =1. When I run the file, I get the following error:

BadRequest: 400 Error while reading data, error message: Could not parse 'Dollar Sales' as DOUBLE for field Dollar_Sales (position 13) starting at location 3913004 with message 'Unable to parse'

My code looks like this:

dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
job_config.autodetect = True
job_config.schema = [
     bigquery.SchemaField("Time", "STRING"),
     bigquery.SchemaField("product", "STRING"),
     bigquery.SchemaField("UPC_13_digit", "STRING"),
     bigquery.SchemaField("Brand_Name", "STRING"),
     bigquery.SchemaField("TGT_Private_Label_National_Brand_Value", "STRING"),
     bigquery.SchemaField("Product_Type_Group", "STRING"),
     bigquery.SchemaField("Primary_Package_Group", "STRING"),
     bigquery.SchemaField("Aisle_Name", "STRING"),
     bigquery.SchemaField("TGT_DPCI_Value", "STRING"),
     bigquery.SchemaField("TGT_Class_Value", "STRING"),
     bigquery.SchemaField("TGT_Subclass_Value", "STRING"),
     bigquery.SchemaField("TGT_Major_Brand_Value", "STRING"),
     bigquery.SchemaField("TGT_All_Brands_Value", "STRING"),
     bigquery.SchemaField("Dollar_Sales", "FLOAT")
]

with open(localfilename, "rb") as source_file:
    job = client.load_table_from_file(source_file, table_ref, job_config=job_config)

job.result()

print("Loaded {} rows into {}:{}.".format(job.output_rows, dataset_id, table_id))

It gives an error at the "Dollar_Sales" column, so I assume it's not actually skipping the header row which is why it can't parse the header "Dollar Sales" because it's a string? When I tested turning auto detect on and not defining the schema, it still included the header row and my entire table was strings. Any ideas on why the leading row is not skipped? Also, I'm confused about "location 3913004" in the error message, as my csv only has about 39k rows. Thanks

EDIT: I should mention that the values in the "Dollar Sales" column of the CSV I am loading in are numerical and I need to keep it as such.

You said: " it can't parse the header "Dollar Sales" because it's a string"

However, the program says: "Could not parse 'Dollar Sales' as DOUBLE "

This means its a datatype issue, a DOUBLE is a number type, you even told it that it was a FLOAT type. If it needs to be STRING set it to that instead perhaps? I am assuming that it is suppose to be a numerical value since its "Dollar Sales" but keep in mind programs will freak out if you try to use one datatype when its expecting another, can't add a string plus a number for example. So if it is throwing an error like this, a variable is of the wrong type somewhere or where its going doesn't accept that type.

EDIT - I just noticed you left off a ] close bracket, I don't know enough about Python to know if that is necessary or if you cut it out or other code on purpose or accident, but perhaps if its not a datatype issue its a termination issue?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM