简体   繁体   中英

Ignore Last row in CSV file as part of BigQuery External table command

I have about 40 odd csv files, comma delimited in GCS however the last line of all the files has quotes and dot

”. 

So these are not exactly conformed csv schema and has data quality issue which i have to get around

My aim is to create an external table referencing to the gcs files and then be able to select the data.

example:

create or replace dataset.tableName 
options (
  uris = ['gs://bucket_path/allCSVFILES_*.csv'],
  format = 'CSV',
  skip_leading_rows = 1,
  ignore_unknown_values = true
)

the external table gets created without any error. however, when I select the data, I ran to error

"error message: CSV table references column position 16, but line starting at position:18628631 contains only 1 columns"

This is due to quotes and dot ”. at the end of file.

My question is: is there any way in BigQuery to consume to data without the LAST LINE. as part of options we have skip_leading_rows to skip header but any way to skip to last row?

Currently my best placed option is to clean the files, using sed/tail command.

I have checked the create or replace external table options list below and have tried using ignore_unknown_values but other than this option i don't see any other option which will work.

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_external_table_statement

You can try below work around:

I tried with pandas and removed the last record from the csv file.

from google.cloud import bigquery
import pandas as pd
from google.cloud import storage

df=pd.read_csv('gs://samplecsv.csv')
client = bigquery.Client()
dataset_ref = client.dataset('dataset')
table_ref = dataset_ref.table('new_table')

df.drop(df.tail(1).index,inplace=True)
client.load_table_from_dataframe(df, table_ref).result()

For more information you can refer to this link which mentions the limitation for loading csv files to Bigquery.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2025 STACKOOM.COM