简体   繁体   中英

Rewriting a csv file with cloud function from cloud Storage and send it to BigQuery

I'm working on a small cloud function python script to rewrite a csv file (skipping some columns) coming from storage and send it to BigQuery.

The BigQuery part of my script is like this:

def bq_import(request):
    job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
    job_config.source_format = bigquery.SourceFormat.CSV
    uri = "gs://url.appspot.com/fil.csv"

    load_job = bq_client.load_table_from_uri(
        uri, dataset_ref.table('table'), job_config=job_config
    )  # API request

    load_job.result()  # Waits for table load to complete.

    destination_table = bq_client.get_table(dataset_ref.table('table'))

I found this script which allows me to rewrite the csv by skipping some columns:

def remove_csv_columns(input_csv, output_csv, exclude_column_indices):
    with open(input_csv) as file_in, open(output_csv, 'w') as file_out:
        reader = csv.reader(file_in)
        writer = csv.writer(file_out)
        writer.writerows(
            [col for idx, col in enumerate(row)
             if idx not in exclude_column_indices]
            for row in reader)

remove_csv_columns('in.csv', 'out.csv', (3, 4))

So I basically need to make these two scripts work together within my cloud function. However I'm not sure how I should handle the remove_csv_columns function and especially the output_csv variable. Should I create an empty virtual csv file? Or an array or something like this? How can I rewrite this csv file on the fly?

I think that my final script should looks like this but something is missing...

uri = "gs://url.appspot.com/fil.csv"

def remove_csv_columns(uri, output_csv, exclude_column_indices):
    with open(input_csv) as file_in, open(output_csv, 'w') as file_out:
    reader = csv.reader(file_in)
    writer = csv.writer(file_out)
    writer.writerows(
            [col for idx, col in enumerate(row)
            if idx not in exclude_column_indices]
            for row in reader)

def bq_import(request):
    job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
    job_config.source_format = bigquery.SourceFormat.CSV
    csv_file = remove_csv_columns('in.csv', 'out.csv', (3, 4))

    load_job = bq_client.load_table_from_uri(
        csv_file, dataset_ref.table('table'), job_config=job_config
    )  # API request

    load_job.result()  # Waits for table load to complete.

    destination_table = bq_client.get_table(dataset_ref.table('table'))

Basically I think that I need to pass the remove_csv_columns to define my cvs file in the bq_import function but I'm not sure how.

by the way I'm learning python and I'm not a developer expert. Thanks.

Your code is quite buggy, I will try to be clear in my correction

uri = "gs://url.appspot.com/fil.csv"

I don't know how your function is triggered, but generally the file to process is contained in the request object like this for event from GCS . Use bucket and name for building dynamically your uri

def remove_csv_columns(uri, output_csv, exclude_column_indices):
    with open(input_csv) as file_in, open(output_csv, 'w') as file_out:

be careful: you use uri as function parameter name and yous use input_csv for opening your input file in read mode. Here your code crash because input_csv doesn't exist!

Another remark here. The uri is the function param name, known only inside the function and with outside relation except with the caller which fill in this value. It have absolutely no link with the global variable that you define before uri = "gs://url.appspot.com/fil.csv"

    reader = csv.reader(file_in)
    writer = csv.writer(file_out)
    writer.writerows(
            [col for idx, col in enumerate(row)
            if idx not in exclude_column_indices]
            for row in reader)

def bq_import(request):
    job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
    job_config.source_format = bigquery.SourceFormat.CSV
    csv_file = remove_csv_columns('in.csv', 'out.csv', (3, 4))

Your input file is static. Read my remark about dynamic uri building.

Look at the function remove_csv_columns : it returns nothing, it simply write a new file in out.csv . Thus, your csv_file represent nothing here. In addition, what this function do? Read the in.csv file and write the out.csv file (by deleting column). You have to pass files to this function

By the way, you have to download the file from Cloud Storage and store it locally. In Cloud Function, only /tmp is writable. thus your code should look like to this

# create storage client
storage_client = storage.Client()
# get bucket with name
bucket = storage_client.get_bucket('<your bucket>')
# get bucket data as blob
blob = bucket.get_blob('<your full file name, path included')
# convert to string
data = blob.download_as_string()
# write the file
with open('/tmp/input.csv', 'w') as file_out:
  file_out.write(data )
remove_csv_columns('/tmp/input.csv', '/tmp/out.csv', (3, 4))

Continue to your code

    load_job = bq_client.load_table_from_uri(
        csv_file, dataset_ref.table('table'), job_config=job_config
    )  # API request

The function load_table_from_uri loads a file into BigQuery from a file present into Cloud Storage. Here, it's not your target, you want to load the file out.csv created locally into the function. The right call is load_job = bq_client.load_table_from_file(open('/tmp/out.csv', 'rb'), job_config=job_config)

    load_job.result()  # Waits for table load to complete.

    destination_table = bq_client.get_table(dataset_ref.table('table'))

Then, think to clean the /tmp directory for freeing the memory, be careful to the Cloud Function timeout, import the correct library in your requirements.txt file (at least Cloud Storage and BigQuery dependencies) and finally take cake of the roles of your cloud function


However, this was only for improving your Python code and skills. In any case, this function is useless.

Indeed, with cloud function, as said before, your can only write on /tmp directory. It's a in memory file system and Cloud Function is limited to 2Gb of memory (files and execution code footprint included). By the way, the size of your input file can't be bigger than about 800Mb and for small file, there is easier.

  • Load your raw file into temp table into BigQuery
  • Perform an INSERT SELECT BigQuery query of your wished column like this
INSERT INTO `<dataset>.<table>`
SELECT * except (<column to ignore>) from `<dataset>.<temporary table>`
  • Delete your temporary table

Because of your file is small (lower than 1GB) and because BigQuery free tier is 5TB of data scanned (You pay only the data scanned not the processing, do all the SQL transformation that you want for free), it's easier to handle data into BigQuery than in Python.

The function processing time will be longer and you could pay processing time on function and not on BigQuery.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM