简体   繁体   中英

Trigger Cloud Storage - Dataflow

I'm starting now and I need some help, I have a custom model that I created using apache beam creating a pipeline that takes the data from a csv file from a folder inside a bucket and throws the data inside a bigquery table, which already works like I wanted, but as it is a batch pipeline it only runs when I run the data flow, I wanted to automate this function, with the condition that when a new file is loaded this job runs by itself, how do I do that?

Template dataflow

import apache_beam as beam
import os
from apache_beam.options.pipeline_options import PipelineOptions

pipeline_options = {
    'project': 'curso-dataflow-beam-347613' ,
    'runner': 'DataflowRunner',
    'region': 'southamerica-east1',
    'staging_location': 'gs://curso-apachebeam-dataflow/temp',
    'temp_location': 'gs://curso-apachebeam-dataflow/temp',
    'template_location': 'gs://curso-apachebeam-dataflow/template/storage_to_bigquery',
    'save_main_session' : True }

pipeline_options = PipelineOptions.from_dictionary(pipeline_options)
p1 = beam.Pipeline(options=pipeline_options)

serviceAccount = r'C:\Users\paulo.santos\Documents\CURSO DATA FLOW\Python-master\curso-dataflow-beam-347613-c998cb1e5f49.json'
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= serviceAccount


def criar_dict(record):
    dict_ = {}
    dict_['name'] 
    dict_['company'] = record[1]
    dict_['pin'] = record[2]
    return(dict_)

table_schema = 'name:STRING, company:STRING, pin:INTEGER'
tabela = 'projeto_curso_dataflow.Curso_dataflow_projeto'

Tabela_Dados = (
    p1
    | "Importar Dados" >> beam.io.ReadFromText(r"gs://curso-apachebeam-dataflow/entrada/dataset.csv", skip_header_lines = 2)
    | "Separar por Vírgulas" >> beam.Map(lambda record: record.split(','))
    | "Criar um dic" >> beam.Map(lambda record: criar_dict(record)) 
    | "Gravar no BigQuery" >> beam.io.WriteToBigQuery(
                              tabela,
                              schema=table_schema,
                              write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                              create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                              custom_gcs_temp_location = 'gs://curso-apachebeam-dataflow/temp' )
)

p1.run()

If you will have a high throughput of files writes you can use the Dataflow template "Text Files on Cloud Storage to BigQuery (Stream)"

This will create a streaming pipeline.

Here we have an example: https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#text-files-on-cloud-storage-to-bigquery-stream

But if you will receive files on your bucket in a very wide time interval, an approach based on cloud functions will be better.

Here we have a good step-by-step: https://medium.com/@aishwarya.gupta3/cloud-function-to-start-a-data-flow-job-on-a-new-file-upload-in-google-cloud-storage-using-trigger-30270b31a06d

See ya.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM